Search Algorithms For Ideal Optimal Mobile Phone ... - Semantic Scholar

2 downloads 0 Views 2MB Size Report
Aug 26, 2006 - In order to achieve the objective a mobile phone will be modelled as a ...... Currently, the company undertakes this process by using the expert ranking ..... also implements the method setMutation() to process the mutation op-.
Search Algorithms For Ideal Optimal Mobile Phone Feature Sets by Lokuge Nadisha de Silva MSc Advanced Software Engineering 2006 Department of Computer Science King’s College London Supervisor: Professor Mark Harman August 26, 2006

To My Loving Parents......

Abstract With the evolution of the field of software engineering it has adapted to component-based activity rather than developing single software systems. However, the managers involved in these projects face the issue of determining which components to include in a particular release while balancing the trade-offs. Release planning, is a vital part of any type of incremental product development. Hence, release decisions become a significant requirement for successful software engineering project management. Systematic and comprehensive release planning based on meaningful data and estimates combined with a firm reliable methodology would contribute towards solving the problems the managers are facing effectively. The thesis attempts to solve the problem of finding optimal feature subsets for release planning by applying search based software engineering. The objective of this thesis is to design and implement meta heuristic search algorithms such as Hill Climbing, Simulated Annealing, Greedy Algorithm and Genetic Algorithms and evaluate them in terms of their performance, effort making and robustness in terms of producing results and the speeds at which they produce results. In order to achieve the objective a mobile phone will be modelled as a component-based software system and its features as software components on which the algorithms will be applied. The project will be carried out using real world data specified by Motorola UK, for thirty-five mobile phone features.

Acknowledgements Professor Mark Harman who made me focus from the very beginning and provided valuable advice and guidance in completing this project and made arrangements for me to be involved in the meetings King’s had with Motorola. Paul Baker of Motorola UK, who gave his valuable time, information and most importantly offered the oppurtunity to join the meetings Motorola had with King’s. Shin Yoo and Alexandros Skaliotis for all the technial support. Nipuna who gave me immense moral support and courage and tolerated me talking about search algorithms all the time, which is not the favorite topic for a finance student. Last but by no means least my parents for all the sacrifices made, and being beside me at all times giving me courage, moral support and advice without which I wouldnt have achieved my goals. If not for them nothing of this would have been possible.

iv

Contents 1 Introduction 1.1 Component Based Software Engineering (CBSE) 1.2 Scenario . . . . . . . . . . . . . . . . . . . . . . . 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . 1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 2 2 3 4 4

2 Background and Related Work 2.1 Meta-Heuristic Search Based Techniques /Algorithms 2.1.1 Hill Climbing . . . . . . . . . . . . . . . . . . . 2.1.2 Simulated Annealing (SA) . . . . . . . . . . . . 2.1.3 Genetic Algorithm(GA) . . . . . . . . . . . . . 2.1.4 Greedy Algorithm . . . . . . . . . . . . . . . . 2.2 Multi-Objective Aspect of the Optimisation Problem . 2.2.1 Pareto Optimisation . . . . . . . . . . . . . . . 2.2.2 Multi-Objective Evolutionary Algorithms . . . 2.3 Chapter Summary . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6 6 6 7 8 13 13 13 13 14

3 Problem Specification of the Single-Objective Scenario 3.1 Mobile Phone as a Component-Based Software System . . . . . . . . . . . . . . . . . 3.2 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15 15 16

4 Design & Implementation Overview of the Single-Objective Scenario 4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Detailed Design and Implementation . . . . . . . . . . . . . . . . . . . . . 4.4.1 Data Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Heauristic Search Algorithms . . . . . . . . . . . . . . . . . . . . . 4.4.3 Specification of the Algorithm Required . . . . . . . . . . . . . . . 4.4.4 Solution Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

17 17 17 17 17 18 18 19 19 25 25 25

5 Evaluation 5.1 Results and Analysis . . . . . 5.1.1 Performance Analysis 5.1.2 Effort Analysis . . . . 5.1.3 Robustness Analysis .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

26 27 27 32 35

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

CONTENTS 5.2

v

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Extension from Single-Objective to Multi-Objective Scenario 6.1 Problem Specification of the Multi-Objective Scenario . . . . . . . . 6.2 Design & Implementation Overview of the Multi-Objective Scenario 6.2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Detailed Design and Implementation . . . . . . . . . . . . . . 6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

. . . . . . .

41 41 42 42 42 42 44 45

7 Summary of Results 7.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Chapter Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46 46 48

8 Conclusion 8.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 49 50 50

Bibliography

51

A Source Code

52

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

vi

List of Figures 2.1 2.2 2.3 2.4 2.5 2.6

Hill Climbing . . . . . . . . . . . . . . . . . . . . . . Selection process in a standard GA . . . . . . . . . . Roulette Wheel Selection. . . . . . . . . . . . . . . . Stochastic Universal Sampling . . . . . . . . . . . . . Uniform Crossover . . . . . . . . . . . . . . . . . . . Pareto front for a bi-objective optimisation problem

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

7 9 10 10 12 14

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

Component diagram for the optimisation problem System design . . . . . . . . . . . . . . . . . . . . . Pseudocode of the Greedy Algorithm . . . . . . . . Pseudocode of Steepet Ascent Hill Climbing . . . . Pseudocode of Simulated Annealing . . . . . . . . Skeleton of the GA designed . . . . . . . . . . . . . Pseudocode of Roulette Wheel . . . . . . . . . . . Pseudocode of Stochastic Universal Sampler . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

18 19 20 21 22 23 23 24

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8

Performance Performance Performance Performance Performance Performance Performance Performance

Hill Climbing . . . . . . . . . . . . . . . . . . . Genetic Algorithms . . . . . . . . . . . . . . . . all the algorithms . . . . . . . . . . . . . . . . . Hill Climbing over the filtered budgets . . . . . Genetic Algorithms over the filtered budgets . . all the algorithms for the filtered budget bounds terms of size for all budget bounds . . . . . . . terms of size for the filtered budget bounds . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

27 28 28 29 30 31 31 32

analysis analysis analysis analysis analysis analysis analysis analysis

of of of of of of in in

. . . . . . . .

1

List of Tables 3.1

Sample dataset for 4 features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4.1 4.2

Database design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Variations of GAs implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18 25

5.1 5.2 5.3 5.4 5.5

Effort analysis . . . . . . . Robustness analysis of the Robustness analysis of the Robustness analysis of the Robustness analysis of the

. . . . .

34 36 37 38 39

6.1

Performance analysis of pareto optimal fronts . . . . . . . . . . . . . . . . . . . . . .

44

. . . . results results speeds speeds

. . . . . . . . . . . . . generated -1 . . . . . generated -2 . . . . . of generating solutions of generating solutions

. . . . . . -1 -2

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Chapter 1

Introduction 1.1

Component Based Software Engineering (CBSE)

As the field of Software Engineering evolves, it adapts more to a component-based activity instead of developing single software systems as was done earlier. Component-based systems are built using a set of components in order to meet a certain set of requirements. In an ideal engineering scenario, a component could be considered as a self contained part or a subsystem which could be utilised as a building block in the design of a large complex system. In the field of Component-Based Software Engineering there exist many definitions for the term component. One of them being [10] an executable unit of which deployment and composition can be performed at run time. Component-based activity has become a great interest in the field of software engineering and has been successful in many engineering and application domains. However, managers of these projects could be facing an issue of determining as to which set of components to include in a particular version/release, while balancing the trade-offs. Software Release Planning (RP) plays a significant role in solving these problems and affecting the success of such software engineering project management. The following scenario illustrates how poor RP could result in unsatisfied customers, time and budget overruns and a loss in market share.

1.2

Scenario

Given that a development team is assigned the task of developing a software system which consists of many functionalities/features. The system will be used by various categories of users prioritised by the ranks they hold. With the limited time and the resources in hand the development process is required to be taken place in iterations releasing a particular version of the system at each iteration. The overall goal of RP in this scenario is to determine a set of features which need to be included in a release, while balancing constraints, maximising desirability of features and minimising resources. This primarily involves Component Selection and Prioritisation. The aspect of component selection is deciding on a subset while balancing the constraints while prioritising is deciding on an order in which each feature should be considered in selection. There are key tasks which result in feasible and valuable processes of release planning, been identified [10]. Some of them are: • Feature Elicitation Features need to be identified and described in an unambiguous way to be understood correctly.

1.3. Motivation

3

• Planning Objectives There should exist an approximate definition of objectives for a release. This could be targeted at maximising the weighted sum of stakeholder(user) satisfaction and the value of releases. • Stakeholder Involvement Different points of views need to be considered which will impact on the process of prioritisation. This could be the views of the users or even the experienced experts in the business. • Constraints The most critical resources required (cost) for the implementation of each feature needs to be considered in order to result in a feasible plan. • Prioritisation of Features A voting scheme is needed in order to allow the stakeholders to show their preference on feature prioritisation. • Prioritisation of stakeholders Stakeholders too need to be prioritised in order to differentiate between the key, secondary and unimportant levels of ranked users. This way the prioritisation decisions made by the users could be considered under the priority of the user who decided on it. Once these tasks are completed release planning will be addressed as an optimization problem, where an optimised set of features will be selected balancing the constraints and trade offs, taking into consideration the results obtained from the above tasks. However, the process of RP following the above tasks can be observed to have weaknesses. They are as follows: • RP was not undertaken based on sound data, models, methodology and experience even when planning of several hundreds of features took place, thus the results produced were not the optima. • Planning takes place without much consideration of the resources available. Hence, the process results in solutions that may exceed the available budget. • The stakeholder involvement in RP is also insufficient and therefore the process fails to address stakeholder concerns appropriately. • The process of RP consumes a significant amount of time for the project managers and the other stakeholders involved in the meetings. All in all, RP in general is essential to any type of incremental product development. Decisions taken during RP contribute towards successful software project management. However, it is important to mention that systematic and comprehensive RP based on meaningful data and estimates together with a reliable definite methodology will provide added value to it, overcoming almost all of its weaknesses.

1.3

Motivation

As illustrated in the scenario, software engineers and managers encounter problems associated with balancing constraints and trade offs when RP is concerned. The scenario also emphasised on the significance of the existence of a systematic and a comprehensive process of RP in order to retrieve better solutions. Yet, obtaining the perfect solution is either impossible or impractical. To address similar problems, software engineers are now considering near optimal solutions or solutions which fall within a particular acceptable range [8]. These factors contribute towards robust meta-heuristic

1.4. Scope

4

search based optimisation techniques to be applicable in the process of RP in order to result in reliable solutions [8]. Application of such techniques could result in time savings due to the automatic generation of solutions, flexibility in the process of RP, and thereby help it deal with changes of estimated resources, budgets and objectives, and last but not least will provide solutions which match customer needs. Meta-heuristic search based techniques by definition could be considered as technologies which are used in order to deal with an optimization process. In other words, these tools could be used in finding the optimal or rather the near optimal solutions to constraint-based situations [7]. There exist various techniques which fall under this category such as Hill Climbing, Simulated Annealing and Genetic Algorithms [7]. It could be said that one of these algorithms could perform better than the other depending on the situation in which they are applied. This could be caused by the variety of data models which exist in variety of situations. This signifies the importance of experimenting the behaviour of a set of heuristic search techniques when applied in the process of RP for component based system.

1.4

Scope

The primary aim of the project is to design and implement heuristic search algorithms such as variations of hill climbing, genetic algorithms, simulated annealing and greedy algorithm and evaluate these algorithms when applied in RP. In order to support this objective, it is aimed to model a mobile phone as a component-based software system, and its features as software components. Each algorithm will be applied in the process of RP for a particular version/release of a mobile phone for varying budgets. The project will be carried out using real world data specified by Motorola UK, for thirty-five mobile phone features. Initially, the evaluation will be carried out for an optimisation problem which considers a single objective. However, the problem of optimisation also has a multi-objective aspect, where it requires the simultaneous optimisation of more than one objective. The thesis therefore, will also look at the possibilities of solving a multiobjective optimisation problem as an extension to the initial attempt.

1.5

Thesis Outline

In this chapter, the motivation for applying meta-heuristic search techniques in RP was presented. Moreover, the aims and objectives of the project were also explained. The remainder of this thesis is organised as follows. Chapter 2 Background and Related Work: This Chapter provides an overview of the existing meta-heuristic search techniques. Chapter 3 Problem Specification of the Single Objective Scenario: Explains the specification of the problem in review which focuses on a single objevtive scenario. Chapter 4 Design & Implementation Overview of the Single-Objective Scenario: Introduces the techniques used in the design and implementation of the system. The design approaches undertaken together with the system and database design for the overall system is described and justified while a detailed overview of all the components of the system is given. Chapter 5 Evaluation: This chapter provides detailed and graphical representations of the evaluation of the results obtained and a critical analysis about the evaluations made.

1.5. Thesis Outline

5

Chapter 6 Extension from Single-Objective to Multi-Objective Scenario: An extension made to the current project which deals with the single objective scenario. This chapter gives an insight to the problem specification, design & implmentation and evaluation of a project carried out to solve the multi-objective aspect of the optimisation problem. Chapter 7 - Summary of Results Presents a summary of the findings mentioned in Chapter 5 and 6. Chapter 8 Conclusion: Key achievements of the project are highlighted and future work, as well as a conclusion of the insights gained from the project are discussed.

Chapter 2

Background and Related Work 2.1

Meta-Heuristic Search Based Techniques /Algorithms

Optimisation is a process which determine a set of parameters for which a certain objective function holds a maximum or a minimum value. Meta-heuristic search based techniques by definition are technologies that are used in order to deal with an optimisation process involving many parameters with conflicting constraints. Thus, they could be used in finding the optimal or rather the near optimal solutions to constraint-based situations [7]. There exist two key factors which need to be considered in approaching an optimisation situation as a search- based problem. They [8] are namely • Representation/Coding Representation of the problem in review is quite essential in solving a search-based problem effectively. Representation could be described as encoding the parameters which contribute towards a candidate optimised solution. Floating point numbers and binary code are frequently being used in the existing applications of optimisation problems [8]. • Fitness/Objective Function Once the problem has been coded, it is then essential to assign a weighting or a measure to each component/parameter, in order to be able to measure each individual solution in the set of possible solutions (the search space) at an ordinal scale. Once a problem has been decided upon its assignment of representation and fitness function, the problem could then be subjected to any one of the meta-heuristic search techniques, which are discussed below.

2.1.1

Hill Climbing

Hill Climbing [7], as the name suggests, begins its search from a random point in the search space and climbs towards a higher point in the space with an intention of climbing to the top of the hill. What happens in this algorithm is that it proceeds its search from a randomly chosen point and then compares it self with its neighbours1 , where it then makes a fitter neighbour (i.e neighbour with a higher fitness score) the current state. This process continues until no fitter neighbour is found where it decides to terminate the process and regards the current state at the terminating instance as the maximum. 1 A neighbour is another candidate solution, which happens to be generated by exchanging a parameter/gene in the current solution.

2.1. Meta-Heuristic Search Based Techniques /Algorithms

7

Figure 2.1: A multi-modal search space which consists of many local maxima and a global maximum, which could be encountered in Hill Climbing. This thesis looks at many variations of the concept of hill Climbing. For instance, one algorithm (Next Ascent Hill Climbing) selects the first ever neighbour which happens to be fitter than the current individual in order to climb the hill while another (Steepest Ascent Hill Climbing) considers all of its neighbours and selects the neighbour with the best fitness score to move. Although this technique follows a hill climbing process where it climbs a hill towards an optimal solution it is not always assured to be climbing towards the global maximum. Instead it could terminate at the first peak (out of the local maxima) it meets, even if it is not the highest in the space. This is more likely to happen when the fitness function of the problem happens to be multimodal (function with many peaks) instead of being unimodal (function with one peak). Therefore, there also exists another type of a Hill Climber (Iterative Hill Climbing) which adds up to the variation of this theme which overcomes this problem. Iterative Hill Climbing, works by repeating any type of Hill Climbing methods mentioned above, by starting from different randomly chosen starting points each time. This method enables the system to discover more than one local maximum and hence to decide the global maximum.

2.1.2

Simulated Annealing (SA)

Simulated Annealing [2] is also based on a neighbourhood structure, yet it also is based on a set of transition rules. Simulated Annealing, like Hill Climbing also commences from a random point in the search space and moves towards the optimal solution. However it does not necessarily move towards a fitter neighbour on its way to the optimal solution. Thus, it could be said that this technique even allows going down the hill. What this really means is that the system could even move to a new state even if its worse (less fitter) than the current state depending on its probability of being accepted. The probability of

2.1. Meta-Heuristic Search Based Techniques /Algorithms

8

being accepted p = e−

∆E T

is a function of the change in fitness score and a parameter T , and happens to be of a similar form to the Maxwell-Boltzmann distribution governing the distribution of energies of molecules in a gas where E and T are related to energy and temperature respectively. In the case of finding a neighbour which happens to be fitter than the current state the probability of being accepted becomes 1 (i.e. movement to the neighbouring state is always performed), otherwise this probability has to be the function p and should hold a value greater than a specified default value. The probability function strongly depends on the values of T and E. At the conception of this algorithm, the T holds a higher value which will result in a higher probability of accepting even very unfavourable moves. During the process of searching the value of T will be reduced slowly by a cooling function which will in turn reduce the probability of accepting unfavourable moves. Cooling schedule is a set of parameters that determine a decrement of the temperature T at each stage of the search. The parameters include the initial temperature, stop criterion, temperature decrement between successive stages, and number of transitions for each temperature value. The cooling schedule is crucial towards the success of the optimisation process. Thus, there exist two key ingredients which need to be given much attention in the process of Simulated Annealing: 1. The definition of neighbourhood 2. The specification of the cooling schedule This special characteristic of even going down the hill avoids Simulated Annealing to be stuck in local maxima like it does in Hill Climbing.

2.1.3

Genetic Algorithm(GA)

The Genetic Algorithm [4], is a search technique based on the genetic processes of biological organisms. The algorithm deals with evolving candidate solutions, towards better solutions which is analogous to natural populations evolving over generations according to the principles of “natural selection” and “survival of the fittest”. The process of evolution starts from the initial generation - a set of individuals which are generated randomly. Once an initial generation has been created, the process then evaluates the current population by their fitness values and selects multiple highly fit individuals to subject to reproduction through cross breeding in order to form a new generation- offspring. As a whole new population is produced through mating of best individuals selected in the current generation, the offspring will inherit higher proportions of good characteristics possessed by its highly fit parents. In this manner, least fit members in generations will die out as they are less likely to be selected for reproduction while good characteristics spread throughout the population over many generations. This way, Genetic Algorithm starts with a diverse search space and restricts it self to a rather promising space on its process of converging towards the optimal solution. In its process of evolving candidate solutions towards the optimal solution Genetic Algorithm uses many operators such as selection, crossover, mutation and replacement. The operators themselves have their own variations which are discussed below. Selection The process of going from the current population to the next population constitutes one generation in the execution of a genetic algorithm. The ‘current’ population for the first generation is the initial population.

2.1. Meta-Heuristic Search Based Techniques /Algorithms

9

Figure 2.2: Selection process in a standard Genetic Algorithm where fit individuals will be selected from a current generation t in order to be put in to an intermediate generation t from where they will be recombined to produce offspring that will be included in the next generation t+1. Selection is an operator applied to the current population in order to create an intermediate population. Intermediate population in other words is the mating pool or the set of parents on which recombination is applied in order to create a new population. Selection plays a vital role in GA as it is responsible for selecting the fittest individuals from a current population that give birth to an offspring: a new even fitter population. If selection does not successfully take place the GA will not be able to obtain the optimum solution. Described below are three types of selection methods [1] used in this thesis. • Roulette Wheel Selection As mentioned earlier selection process evaluates the fitness of the current population by the probability of each of its individuals being selected. The probability of an individual in the population being accepted can be considered as p=

fitness score of the individual total fitness score of the population

In the process of this particular type of selection method, the individuals are mapped on to adjacent segments of a line such that the size of a segment is proportion to the probability of the corresponding individual being accepted. A random number is then generated, and the individual which corresponds to the segment that spans this number is selected. This process will repeat until the desired amount of individuals is selected for the intermediate population. Example: There are 11 individuals in a population with the fitness scores 2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 0.8, 0.6, 0.4, 0.2 and 0.0 . Hence, they will have a probability of being accepted as 0.18, 0.16, 0.15, 0.13, 0.11, 0.09, 0.07, 0.06, 0.03, 0.02 and 0.0 respectively. Suppose the number of individuals

2.1. Meta-Heuristic Search Based Techniques /Algorithms

10

Figure 2.3: Roulette Wheel Selection is when the individuals are mapped on to adjacent segments of a line such that the size of a segment is proportion to the probability of the corresponding individual being accepted. Figure represents the process explained for 6 randomly generated numbers such as 0.81, 0.32, 0.96, 0.01, 0.65 and 0.42. Therefore the selection process will result in the mating population consisting of the individuals 1,2,3,5,6, and 9.

Figure 2.4: Stochastic Universal Sampling is when individuals are mapped on to adjacent segments of a line such that equally spaced pointers are placed on the line as many as the number of individuals to be selected. Figure shows the process of Stochastic Universal Sampling for 6 individuals to be selected. The selection process results in an intermediate population with the individuals 1, 2,3,4,6 and 8. needed for the intermediate population is 6, Figure 2.3 represents the process explained for 6 randomly generated numbers such as 0.81, 0.32, 0.96, 0.01, 0.65 and 0.42. Therefore the selection process will result in the mating population consisting of the individuals 1,2,3,5,6, and 9. Clearly, the first individual is the highest scored individual in the population and hence occupies a large interval in the line while the 11th individual has a fitness score of 0.0 and therefore has no chance of getting selected. • Stochastic Universal Sampling As done in the Roulette Wheel Selection, the individuals are mapped on to adjacent segments of a line. However what makes this method different to the previous method is that equally spaced pointers are placed on the line as many as the number of individuals to be selected. As 1 they are equally spaced, the distance between each of them happens to be . number of pointers For instance if the number of individuals needed is 6 then 6 pointers will be placed on the line with the distance between pointers being 16 . The individuals which correspond to the segment which spans these pointers will then be selected. Figure 2.4 shows the process of Stochastic Universal Sampling for 6 individuals to be selected. As the figure illustrates the selection process results in an intermediate population with the individuals 1, 2,3,4,6 and 8.

2.1. Meta-Heuristic Search Based Techniques /Algorithms

11

• Tournament Selection In Tournament Selection, two individuals are chosen randomly from the current population and the best individual out of the two is selected as a parent. This process is repeated as often as the number of individuals which must be chosen for the intermediate population. Recombination/Crossover Once the selection process is carried out and the intermediate population is produced, recombination takes place. Recombination could be considered as the process of creating the new population from the intermediate population by recombining parent individuals by applying one of the crossover methods which exist. The crossover is applied to each pair of individuals in the mating pool from the beginning till the end. Explained below are three types of crossover techniques used in this thesis. (The binary strings 1101001100101101 and yxyyxyxxyyyxyxxy (values 0 and 1 are denoted by x and y) which represent individuals in a mating pool will be considered in the following explanations.) • One-point Crossover In this method, a single crossover point will be picked and the two strings will be split at this point as 11010 * 01100101101 and yxyyx * yxxyyyxyxxy. They will then be recombined by swapping the fragments between the two parents producing the offspring: 11010yxxyyyxyxxy and yxyyx01100101101 • Two-point Crossover In this method, two crossover points will be picked and the two strings will be split at these points as 11010 * 01100 * 101101 and yxyyx * yxxyy * yxyxxy. They will then be recombined by swapping the fragments between the two parents producing the offspring: 11010yxxyy101101 and yxyyx01100yxyxxy. • Uniform Crossover The uniform crossover technique is different to the other techniques in many ways. One is that it produces a single off spring from two parents as opposed to other techniques which produce two offspring. There are no crossover points associated. The offspring will be created by copying a corresponding gene from one parent or the other, depending on randomly generated crossover mask. A crossover mask is a randomly generated binary string which is similar to the format of the representation of the parent individuals. The offspring will be generated by copying a gene from the first parent where there is a 1 in the corresponding gene in the uniform mask. It will copy a gene from the second parent where there is a 0 in the corresponding gene in the mask. This process is illustrated in Figure 2.5. A new uniform mask is randomly generated each time a new pair of parents in the mating pool are utilised to produce an offspring. When similar pair of parents happen to be recombined using one or a two point cross over the segments exchanged will be identical leading the offspring to result in an identical individual to that of its parents. Yet, this disadvantage does not exist in case of uniform crossover. Hence, clearly uniform crossover could be considered as a robust technique.

2.1. Meta-Heuristic Search Based Techniques /Algorithms

12

Figure 2.5: Uniform Crossover where the offspring is generated by copying a gene from the first parent where there is a 1 in the corresponding gene in the uniform mask, and from the second parent where there is a 0 in the corresponding gene in the mask Mutation Once recombination has taken place mutation operators are applied to the offspring. Each individual will be mutated in the new population when the probability of being mutated for that individual is less than a default probability, specified. If the mutation probability for the particular individual is within the range, mutation of that individual takes place by changing/mutating2 the current value of a random gene in the respective individual. Evaluation Once the process of selection, recombination and mutation take place evaluation will be the next operation in row. As recombination most of the time, will result in an offspring which is double the size of a population, evaluation will evaluate and extract the best individuals in the offspring to match the number of individuals that should be present in a population3 . Termination Criteria The process of selection, recombination, mutation and evaluation forms one ‘generation’ in the execution of a genetic algorithm. This process of evolving will be repeated until a selected termination conditions are met. The new generation formed in the previous iteration will be considered as the current population in the next. The individual with the highest fitness in a generation will be the optimal solution in that generation. The correct solution to the problem will be the fittest individual found in the generations so far. There are three types of termination conditions which could be used in Genetic Algorithms. They are: • Quality of Solution Meeting the expected optimal solution. • Time Certain number of generations have been created and tested 2 ‘mutating’ a gene means changing its value. If the solution is encoded in binary, the value could be changed to 1 if the current value is 0 and vice versa. 3 For the purpose of this thesis, a population will always have fixed number of individuals

2.2. Multi-Objective Aspect of the Optimisation Problem

13

• Diminishing returns The improvements made in each generation not exceeding a certain threshold In this thesis, the ‘Time’ termination condition will be used.

2.1.4

Greedy Algorithm

Greedy Algorithm works in such way that it maintains a candidate set that represents each component of the problem in review. It then selects each of them at a time, checks if it is feasible to be added to the solution (the subset) depending on its fitness score. It will add the component to the solution if the component is feasible else it will remove the component even from the candidate set.

2.2

Multi-Objective Aspect of the Optimisation Problem

The background knowledge provided above is related to a particular aspect of the optimisation problem which pays attention to a single objective in solving the optimal solution. However, realistic optimisation problems some times require the simultaneous optimisation of more than one objective. This involves solving an optimisation problem using more than one fitness function. One way of solving the multi-objective problem is to combine the multiple objectives in to one scalar objective by weighting the objectives with a weight vector. This process allows the use of any single objective optimisation algorithm mentioned above in order to solve the problem. However, the resulting solutions may then depend on the weight vector used in the weighting process. Another way of solving multi-objective optimisation problems is by finding the set of pareto optimal solutions to the problem. Instead of finding one global optimum, which is the general aim of solving an optimisation problem, this approach looks at finding a set of solutions which is called a pareto optimal front which are equivalently important and are all global optimal solutions. Further details about this approach are described in the next section.

2.2.1

Pareto Optimisation

Pareto optimality is a measure of efficiency in multi-criteria and multi-party situations. The concept is widely applicable in solving multi-objective optimisation problems which consider two or more objectives (fitness functions) which are measured in different units [9]. Given solutions A and B, both the solutions are pareto optimal given that they are both optimal, non comparable and no other solution exists that are better than them with respect to all objectives. Given a bi-objective scenario, A and B are said to be non comparable if Objective 1 of A is greater than Objective 1 of B, while Objective 2 of A is less than Objective 2 of B and vice versa. These solutions could also be called as non-dominated solutions. If there exist a solution that performs less with respect to all objectives upon comparing with A and B this solution could be considered as a dominated solution. A set of pareto optimal solutions is called a pareto front. A pareto front can be visualized in a scatterplot of solutions where each objective/fitness function could be graphed on a separate axis. See Figure 2.6. It is easy to visualize a pareto front of solutions for a bi-objective problem, than with three or more objectives as the objectives to be maximised could be represented in X and Y axis.

2.2.2

Multi-Objective Evolutionary Algorithms

The conventional optimisation techniques such as hill climbing, genetic algorithms, simulated annealing and greedy algorithm may have difficulties in extending themselves to the multi-objective case as they are not originally designed to find multiple solutions.

2.3. Chapter Summary

14

Figure 2.6: Pareto front for a bi-objective optimisation problem which consists of equivalently important and global optimal solutions. However, it has been observed that genetic algorithms possess an interesting property which could support solving multi-objective optimisation problems when used together with the concept of pareto optimality [6]. The property of GAs which has contributed towards its success in multi-objective optimisation is its concept of generating populations which has the ability of finding multiple optima simultaneously. There exist various versions of multi-objective genetic algorithms. However, due to time limitations this thesis discusses the algorithm, Vector Evaluated Genetic Algorithm(VEGA) [5]. Vector Evaluated Genetic Algorithms(VEGA) VEGA is a result of an extension made to the conventional genetic algorithms in order to include multiple objective functions. The extension was made by changing the selection operator of the conventional GA. This operator was modified in such way that, suppose the number of individuals which needs to be included in a generation is N , and the number of objectives handled in the problem is k, k sub-populations of size N k will be selected at each generation, using a different objective function each time. The sub-populations will then be shuffled together in order to obtain a new population of size N on which the conventional GA operators crossover and mutation will be applied. Schaffer who came up with this idea has considered that the solutions generated at each generation are non-dominated [5].

2.3

Chapter Summary

The chapter has explained the concept of Meta-Heuristic Search techniques and described the different types of search algorithms used in the rest of the thesis.

Chapter 3

Problem Specification of the Single-Objective Scenario 3.1

Mobile Phone as a Component-Based Software System

Motorola, that acted as a partner in this project states that it covers a range of portable communication devices (mobile phones) comprising of various features which attract the users and contribute towards the revenue-to-cost ratio of the company [3]. Each of these features are said to possess their own characteristics such as cost of acquisition, customer desirability and the expected revenue. They are also prioritised (or ranked) by a set of experts with the experience they have regarding the features. In order to work on the respective problem, we were provided with an annonymised data set from Motorola. The dataset contains 35 mobile phone features together with the fields cost, customer desirability, revenue and Expert Ranking which are the characteristics of the 35 features given. The customer field refers to the priority given to a customer. The field has its values as 1, 2, 3 and 4, where 1 represents the highest priority. The revenue field consists of values 1,2 and 3, where 1 represents the lowest revenue. Expert Ranking Field contains 1,2,3,. . . , 35 where 1 is the highest rank given. In solving the problem of evaluating an optimal subset of the available features for the next release, it is essential to associate a weighting with each of the component/feature connected to the device, possibly by combining the customer desirability and the revenue brought by the particular feature [3]. Therefore, solving the problem will be determining a particular subset which maximises the total sum of the weights while minimising the sum of the cost of selected subset. Currently, the company undertakes this process by using the expert ranking as the weighting associated to each component. Given the sample data given in Table 3.1, and that the budget which needs to be covered is 300, the features 1,2 and 3 will be included in the subset rather than feature 4 alone, which also costs 300 but also is ranked the lowest. This is simply due to the fact that the Feature Name Feature1 Feature2 Feature3 Feature4

Customer 3 3 1 1

Cost 100 50 150 300

Revenue 3 3 2 1

Expert Ranking 1 2 3 4

Table 3.1: Sample dataset for 4 features

3.2. Chapter Summary

16

process of determining an optimal subset should not only minimise the sum of the cost but also should maximise the total sum of the weights.

3.2

Chapter Summary

The chapter has explained the specification of the problem in review.

Chapter 4

Design & Implementation Overview of the Single-Objective Scenario 4.1

Approach

The initial steps of approaching a search- based problem involve paying attention to the key factors such as Representation and Fitness Function. Following is a description of the representation and fitness function adapted in order to solve the current problem.

4.1.1

Representation

The problem will be represented as a binary string, where a candidate solution (subset) will consist 35 features represented by a binary string of 35 digits. The number 1 appearing as the nth digit represents the presence of the nth ranked feature and 0 implying the absence of it.

4.1.2

Fitness Function

The fitness function which was chosen for the respective project is: revenue ∗

1 customer desirability of a feature

The fitness function was adopted from Harman et. al [3]. The reason behind the chosen fitness function is that the value which represents the revenue is directly proportional to the revenue of the feature while the value representing the customer desirability is inversely proportional to priority of the customer as mentioned in 3.1.

4.2

System Design

The system is designed in such way, to apply a specified algorithm on a set of features stored in a database in order to obtain an optimal subset for a given budget. Figure 4.1 illustrates the component diagram of the system designed. The system is designed to obtain each of the feature details such as customer desirability, revenue and cost from the database

4.3. Database Design

18

Figure 4.1: component diagram for the optimisation problem Field Name Feature Name Customer Revenue Cost Expert Ranking

Data Type Text Text Number Number Number

Description Name of the Feature Customer priority: Ranges between 1- 4 Revenue the feature brings: Ranged between 1-3 Cost of the feature Rank of the feature given by the Experts

Attributes Primary Key

Table 4.1: Database design to the system. It will then calculate the fitness value for each feature and will store them in a separate fitness array. Once the initial requirements are met, the system will then apply the specified algorithm and evaluate an optimal subset of features for a specified budget.

4.3

Database Design

The system uses the database to store the feature details. The database contains a table for a given set of 35 features. A similar table could be created in order to store another set of features which needs to be implemented in another model of mobile phone. Shown in Table 4.1 is the table that contains in the database. • Table Name: Features • Purpose : This table contains 35 features and their given characteristics in order to be utilised by the system to solve the optimisation problem.

4.4

Detailed Design and Implementation

This section explores the detailed design and Implementation of the processes involved in the system. The system is implemented in Java and is optimised for fast execution.

4.4. Detailed Design and Implementation

19

Figure 4.2: System design flowchart

4.4.1

Data Handling

The system accommodates heuristic search algorithms such as variations of hill climbing, genetic algorithms, simulated annealing and greedy algorithm of which the design related decisions were supported by the facts discussed in Chapter 2. In order for these algorithms to be executed the system requires to obtain the set of features in concern and calculate each of its fitness score. Hence, the class DataHandler was implemented to manage the handling of data. The method featureHandler() in this class accesses the database and retrieves the revenue, cost, and the customer desirability of each of the features and stores them in three separate arrays called revenue, cost and customer respectively. The values are stored in the database in the order of the expert rank of the feature from 1 to 35. Thus, the values will be accessed from the database in this order and the values related to a particular feature will be stored in the mentioned arrays under the same index. Once the data is obtained, the method calculateFitness() will then take the revenue and the customer desirability of each feature and then calculate the fitness score of the feature using the chosen fitness function. It will then save the fitness score at the corresponding index in an array called fitness. The methods supported in the class DataHandler handles the initial steps which need to be completed before executing any heuristic algorithm on the data available.

4.4.2

Heauristic Search Algorithms

The system contains heuristic search algorithms including 2 variations of hill climbing, 9 variations of genetic algorithms, simulated annealing and greedy algorithm. Each of these algorithms outputs the optimal solution for the problem in review as an array named solution of 35 binary digits. The value of 1 in solution[i] indicates the presence of the feature ranked as ‘i’ by the experts. Following sections discuss the design and implementation overview of the algorithms implemented. Greedy Algorithm The Greedy Algorithm was designed in such way that all the features have been sorted in the descending order of their fitness score. In this manner the solution will contain all the components with the highest score until the budget bound has been reached. If the inclusion of a particular

4.4. Detailed Design and Implementation

20

for(i = 1 to number_of_components){ solution[i]=0; } for(i = 1 to number_of_components){ if(current_cost + cost[i] bestScore){ bestScore = score_of_currentSolution; stillToClimb = true; }else stillToClimb = false; } Figure 4.4: Pseudocode of the method HillClimbing1() which implements Steepet Ascent Hill Climbing random initial candidate solution. It first selects all the features in the solution array and will then randomly select a component and remove it from the set until the cost of the candidate solution fits within the budget. The class then calls the method setNeighbours(). This method will generate all the neighbours of the initial solution. The process of generating a neighbouring point is flipping a single bit in the solution array. Since the solution has 35 bits it will then have 35 different neighbouring solutions. The method stores all these neighbours in a separate array. It will then calculate the score of all the neighbours, store them in another separate array and will sort both the neighbours and the scores arrays in the descending order of the scores. There also exists the method buildClimber() implemented in this class. buildClimber() is responsible for returning the neighbour which happens to have the highest score but also happens to be within the budget. The process of searching for the optimal solution takes places until no fitter neighbour is found. Figure 4.4 illustrates the pseudocode of the main HillClimbing1() method which calls all the methods described above. Next Ascent Hill Climbing This particular variation of hill climbing, is designed to move to the first ever neighbour which is found out to be fitter than the current solution and is within the required budget. The process will repeat until no fitter neighbour is found. The class which implements the respective technique is HillClimbing2. This class also includes a getCurrent() method in order to generate the initial solution. A method called buildClimber() is implemented in order to process the search. What happens in this method is it generates a neighbour to the current solution by flipping a single bit in the current solution. However, this process will be stopped as soon as a neighbour is generated which has a higher fitness score and also happens to fit within the budget. This process will continue until no fitter neighbour is found. Simulated Annealing Simulated Annealing is implemented by the class SimulatedAnnealing. The parameters needed to execute the designed algorithm are derived from Harman et al. [3]. The class implements a method SimulatedAnnealing() which is responsible for the definition of neighbourhood and the specification of a cooling schedule which happen to be the 2 key ingredients of a well performing simulated annealing algorithm.

4.4. Detailed Design and Implementation

22

Generate an initial Solution; initialTemp = -(zMax/log(1-p1)) While initialTemp > = finalTemp{ for(i = 1 to L){ Generate a neighbouring solution; If (cost_of_Neighbour < = budget){ If (score_of_neighbour > = currentScore){ currentScore = score_of_neighbour; If (currentScore > = bestScore) bestScore = currentScore; }else{ Generate a random probability r If (r < e-(currentScore score_of_neighbour)/initialTemp){ currentScore = score_of_neighbour; if ( currentScore > bestScore) bestScore = currentScore; }else Go back to Solution before; } }else Go back to Solution before; } initialTemp = initialTemp * ( 1 - p2 ); } Figure 4.5: Pseudocode of Simulated Annealing The method generates an initial solution randomly, and generates a neighbour to the initial solution by flipping a random bit in the initial solution. If the neighbour generated happens to have a higher fitness the search will then move to this point otherwise it will consider the transition rules in order to continue the search. Figure 4.5 illustrates the pseudocode for the method SimulatedAnnealing(). The parameter setting for the algorithm as derived from Harman et al. [3] is as follows: P1 = 0.8, p2 = 0.2, finalTemp = 0.005, Zmax = total score of all the features and L =15000 where L denotes the number of iterations at each level of temperature. As it could be seen in the pseudocode, the initial temperature is: initialT emp =

zM ax log(1 − p1)

And the cooling schedule happens to be initialT emp = initialT emp ∗ (1 − p2). Genetic Algorithms (GA) There are 9 variations of genetic algorithms being designed and implemented using 3 types of selection and crossover techniques. The algorithms are implemented using two classes namely GAManager and GAHandler where GAManager is responsible for implementing all the methods needed to generate a genetic algorithm while GAHandler is responsible for calling the methods in GAManager depending on the variation of algorithm requested.

4.4. Detailed Design and Implementation

23

t=0 Generate initial population P(t) at random; Evaluate each individual in P(t); While termination condition not satisfied do Begin t=t+1; Select some parents S(t) from P(t-1); Generate offspring Q(t) by applying crossover and mutation on S(t); Evaluate offspring Q(t); End; Figure 4.6: Skeleton of the GA designed Generate a random value r between 0 and 1 sum = 0; for(i = 1 to N){ sum = sum+p; if(sum>=r) return i; } Figure 4.7: Pseudocode of Roulette Wheel Selection Method The skeleton of the GA designed is shown in Figure 4.6: The class GAHandler is derived from the above skeleton designed and is implemented in the method geneticAlgorithm(). The termination criterion is simply that the algorithm terminates after the completion of 200 iterations(generations). The best answer generated in the iterations will be saved and output as the optimal solution for a specified budget. In order for a specific variation of a genetic algorithm to be requested GAHandler.geneticAlgorithm(selection type,crossover types) needs to be called passing the required selection and crossover type. Thereby, the method will call the corresponding selection and corssover methods and the other GA Operators implemented in GAManager. The method getInitialPopulation() generates a population of candidate solutions. The number of individuals which will contain in any population created in the process is made to be 500. Therefore, the method will randomly generate 500 initial solutions. GAManager contains methods which implement all the selection, crossover, and mutation techniques designed. There are three variations of selection types implemented namely roulette wheel, stochastic uniform sampling and tournament selection methods while there are three types of crossover techniques available namely one point, two point and uniform crossover methods. The selection methods are used to select a mating pool out of the initial population generated. The method tournamentSelection() selects two random individuals from the current population compares their fitness and selects the best one out of two and stores it in an intermediate population array. This process will be continued till 500 individuals are being selected for the intermediate population. The method rouletteWheelSelection() implements the roulette wheel selection method. The pseudocode which describes the process which takes place in this method is shown in Figure 4.7. Figure 2.3 also illustrates this operation. Suppose the fitness of N individuals in the population are f1 , f2 ,f3 , . . . , fn the probability of an

4.4. Detailed Design and Implementation

24

Pointer + = pointSpace; sum = 0; for(i = 1 to N){ sum = sum + Pi; if(sum >= pointer) return i; } Figure 4.8: Pseudocode of Stochastic Univeral Sampler Selection Method individual being selected is: pi =

f1 f1 + f2 + f3 + . . . + fn

The process in the pseudocode will be repeated as many times as the number of individuals needed for the intermediate population. In our case this will be done 500 times. Figure 4.8 illustrates the pseducode which explains the process undertaken in the method stochasticUniversalSampler(). Figure 2.4 also illustrates this operation. Even in this case the probability of an individual being selected is pi =

f1 f1 + f2 + f3 + . . . + fn

given that the fitness of N individuals in the population are f1 , f2 ,f3 , . . . , fn . The process described in the pseudocode will be repeated as many times as the number of individuals needed for the intermediate population, which is 500. This in turn means that the number of equally spaced pointers created in order to process the stochastic universal sampling technique designed is 500 making the pointSpace, the space between any two points to be 1/500. The method onePointCrossover() selects each pair of parents from the mating pool and generates a random point from where the 2 parent strings will be split as 11010 * 01100101101 and yxyyx * yxxyyyxyxxy. It will then recombine the two parents by swapping the fragments between them producing the 2 offsprings: 11010yxxyyyxyxxy and yxyyx01100101101. Likewise, the twoPointCrossover() method too will select each pair of parents in the intermediate population and generate two random points from which they will be split as 11010 * 01100 * 101101 and yxyyx * yxxyy * yxyxxy. They will then be recombined by swapping the fragments between the two parents producing the offsprings: 11010yxxyy101101 and yxyyx01100yxyxxy. The method uniformCrossover() on the other hand generates a random crossover mask which is a 35 digit binary string for each pair of parent individuals in the mating pool. If the ith bit of the mask happens to be 1 the ith bit of the offspring will be taken from parent 1 where as when the ith bit of the mask appears to be 0 the ith bit of the offspring will be copied from the corresponding bit from parent 2. In all three crossover methods explained above, there exists a common functionality which takes place once the offspring are produced. This will be to check if the generated offspring are within the specified budget. If the offspring do not happen to be within the budget the respective offspring will be replaced by its parents. The class GAManager also implements the method setMutation() to process the mutation operator on the generated offspring. For each offspring the method generates a random probability and compares it to the default probability which is 0.6. If the generated value is greater than the default it will process the mutation on the particular offspring. Otherwise, it will move on to the next offspring. The method makes sure that the mutated offspring is within the budget. setReplacement() is another method implemented in the class GAManager. This method sorts the offspring in the descending order of their fitness scores and selects only the best 500 individuals

4.5. Chapter Summary GA Type GA1 GA2 GA3 GA4 GA5 GA6 GA7 GA8 GA9

25 Selection Type Roulette Wheel Roulette Wheel Roulette Wheel Stochastic Universal Sampling Stochastic Universal Sampling Stochastic Universal Sampling Tournament Selection Tournament Selection Tournament Selection

Crossover Type One point Two point Uniform One point Two point Uniform One point Two point Uniform

Table 4.2: 9 Variations of GAs implemented and their selection and crossover types to be put in the next generation. This is because each population generated in the algorithm should have a fixed value which happens to be 500 in the case of this particular GA designed. Finally, setOptimal() method returns the offspring with the best score which in tern is the optimal solution of the current generation. The solution will be compared with the best solution so far by the method geneticAlgorithm() in GAHandler and will be saved if it happens to be even better than the best result so far. Table 4.2 illustrates the combination of selection and crossover types which resulted in the 9 variations of GAs implemented in this thesis.

4.4.3

Specification of the Algorithm Required

The system includes the class SearchManager, which calls the methods DataHandler.featureHandler() and DataHandler.calculateFitness() that perform the initial steps necessary to execute any algorithm. Once these methods are called the required algorithm could be called in this class. For instance, if the iterative next ascent hill climbing needs to be executed on the data the following method could be called: New HillClimbing3.Hillclimbing3(“Next Ascent”)

4.4.4

Solution Handling

During the execution of an algorithm there are many instances where the calculation of the score of a particular solution is required. Once the algorithm finishes execution it also needs to obtain the score, size and the cost of the optimal solution subset generated. SolutionHandler class implements methods which facilitate these needs in order to include better software engineering aspects to the system.

4.5

Chapter Summary

The chapter has explored the approaches which need to be taken in order to solve the current problem in review. It also has given a detailed overview of the system and the database design and the main components related to the system.

Chapter 5

Evaluation The evaluation of the heuristic search techniques discussed in the thesis was carried out in 3 different ways which involved the analysis of performance, effort and robustness of the techniques. All three of these measures are equally significant in the process of evaluating the techniques implemented, as they examine different aspects of them. Analysis of performance as the name suggests explores the algorithms in order to see which one of them performs the best by generating the most optimal solution or rather the solution which is the closest to the optimal solution. On the other hand, effort analysis examines the algorithms to see which algorithm takes up more cost/time in generating the results. Analysis of robustness it self has two aspects. One is to analyse the robustness of the results produced by the algorithms while the other is to analyse the robustness of the speed of the algorithms in producing the results. This analysis evaluates how consistent the algorithms are, in terms of the results they produce and the speed at which they generate them despite the inherent random nature of them. Each of the above mentioned measures of analysis become quite important as they all contribute towards evaluating the techniques implemented, in different ways. It is hard to conclude which one of them is better than the other in evaluating the algorithms. However, one could matter more than the other depending on the situation of the problem in review. For instance, if generating the “almost optimal” solution is what matters the most in solving the optimisation problem, the performance analysis will be more important where as effort analysis will be more important if having less costly ways of generating optimal solutions matters a lot. However, there could also be scenarios where it is important to generate robust solutions rather than generating the optimal solution. In that case, it will be considered to use an algorithm of slightly less performance but of more robustness. In this way, each of the measures have their own significance in evaluating the techniques implemented. The following sections describe the experimental methodology undertaken, the related results and the analysis carried out keeping in mind each of the above mentioned measures. The general experimental methodology for the evaluation involved subjecting 35 features in a series of 35 feature subset selection experiments for each technique implemented. A budget bi was allocated to each experiment Ei for which a subset of features was to be selected applying a particular automated algorithm. Each budget bi , allocated was the budget that was implicitly required by the experts ranking method to determine that the first i features need to be included in a subset, given only i features were required to be included in a particular product.

5.1. Results and Analysis

27

Figure 5.1: Performance analysis of Hill Climbing for all budget bounds which illustrates the scores achieved by each variation for each budget bound.

5.1

Results and Analysis

In order to mitigate the effects of inherent randomness of the algorithms, 10 iterations of each feature subset selection experiment were executed on each of the algorithms and the average values of the results such as the overall score and the number of features included in the resulting subset, obtained over the iterations were recorded. The results and the analysis of them are presented in this section.

5.1.1

Performance Analysis

The performance analysis involves evaluating the results generated by each of the algorithms. Experimental Methodology The 35 feature subset selection experiments were carried out for each algorithm, recording the scores and the sizes of the resulting subsets along with the time taken for them to be generated at each experiment. Results and Performance Analysis The entire set of experiments to produce the feature subsets using all the approaches practically took no noticeable time with the use of standard computing equipment. Figure 5.1 and 5.2 illustrate the performance of the Hill Climbing and the Genetic Algorithm variations for thirty five budget bounds respectively. As it can clearly be seen, the Steepest Ascent Hill Climbing (SAHC) seems to be performing better than Next Ascent Hill Climbing (NAHC) technique. However, according to Figure 5.2, all 9 variations of GA seem to be performing quite closely and yet pointing out that GA9 performs the best out of all 9.

5.1. Results and Analysis

28

Figure 5.2: Performance analysis of Genetic Algorithms for all budget bounds which illustrates the scores achieved by each variation for each budget bound.

Figure 5.3: Performance analysis of all the algorithms for all budget bounds which illustrates the scores achieved by each method for each budget bound.

5.1. Results and Analysis

29

Figure 5.4: Performance analysis of Hill Climbing over the filtered budgets within the range 1740 3620 which illustrates the scores achieved by each variation for each budget bound in the range. Therefore, the graph presented in Figure 5.3 has been generated representing the overall scores of Expert Ranking, Greedy Algorithm, Steepest Ascent Hill Climbing (SAHC), Simulated Annealing and GA9 methods for all 35 budget bounds. As expected, the graph of all these performance functions clearly coincide at the last budget bound as it is the budget which allows the inclusion of all the features in the subset. It could also be seen in the graphs that all the methods implemented seem to be performing considerably better than the expert ranking. However, the performance of the implemented techniques seems to be almost the same for higher budget values while it seems to be rather incomparable at smaller values. The reason behind this could be that the higher the budget value, the higher the amount of components that could be included in a subset irrespective of what the algorithm being used. Hence the above mentioned graphs were regenerated for a filtered set of budget bounds which produce results that give a clear view of the performance of the algorithms. Presented in Figure 5.4 is the performance of the 2 variations of Hill Climbing techniques for a filtered budget bound from 1740 till 3620. Clearly, Steepest Ascent Hill Climbing (SAHC) seems to be performing better than Next Ascent. The reason behind this could be that SAHC, evaluates all possible neighbours to a current point and then moves to the best out of all unlike NAHC, which moves to the first ever neighbour which happens to be fitter than the current point. In this way SAHC could climb towards the optimal solution always picking the best possible solution it could find on its way. Figure 5.5 illustrates the performance analysis for the 9 variations of genetic algorithms for the same filtered budget range. According to the graph it is hard to conclude which variation of GA is better than the other, however, it is quite clear that GA9 is performing the best out of all. It could perform this way probably due to the combination of the selection and the crossover techniques it uses. Tournament selection, out of all three selection types could perform better in selecting the fitter parents as it involves in literally selecting the better parent out of each randomly chosen pair of parents where as the other two selection methods select parents depending on the probability a parent

5.1. Results and Analysis

30

Figure 5.5: Performance analysis of Genetic Algorithms over the filtered budgets within the range 1740 - 3620 which illustrates the scores achieved by each variation for each budget bound in the range. component holds to be selected. In this way tournament selection always causes the mating pool to be fitter resulting in offspring having to share the fitter features of both its parents. Moreover, GA9 also uses the uniform crossover technique. It could be understood that uniform crossover technique also results in a fitter and more robust search space. Suppose, two similar parent genes were subjected to reproduction, then both one point and two point crossover methods will result in producing offspring which, are identical to its parents. None the less, this is less likely to happen when using uniform crossover. Thus, it could cause the search space to be vast. The graph shown in Figure 5.6 represents the performance analysis of the methods Expert Ranking, Greedy Algorithm, Simulated Annealing , Steepest Ascent Hill Climbing (SAHC) and GA9 over the same filtered budget range discussed earlier. The initial observation is that Simulated Annealing produces the best score with SAHC, GA9 and Greedy Algorithm following respectively and the expert ranking performing least well. The patterns of the scores gained from the techniques could be explained by observing the number of components each solution subset contains. Figure 5.7 shows the number of component each solution contained for all the budget bounds while Figure 5.8 shows the number of components each solution could include for the previously mentioned filtered budget bound. As it could be seen from both these graphs, the number of components of the solutions generated by expert ranking increased by one with each budget increase while the other implemented methods were able to include many more components at each experiment hence generating higher fitness scores. The reason behind Simulated Annealing to perform the best could be that it has the ability of considering a vast search space especially with its ability of even going down the hill on its way towards the optimal solution as mentioned in Section 2.1.2. Moreover, SAHC seems to be dealing with the second vast search space than GA9 in the case of the problem in review. This is probably due to the fact that the SAHC evaluates the entire set of neighbours to each current point as well as it does climb many hills in order to decide on a global optimum.

5.1. Results and Analysis

31

Figure 5.6: Performance analysis of all the algorithms for the filtered budget bounds which illustrates the scores achieved by each method for each budget bound in the range.

Figure 5.7: Performance analysis of all the algorithms for all budget bounds which illustrates the number of components selected by each method for each budget bound.

5.1. Results and Analysis

32

Figure 5.8: Performance analysis of all the algorithms for the filtered budget bounds which illustrates the number of components selected by each method for each budget bound in the range. It is also evident as to why all the other techniques implemented outperform Greedy Algorithm. The algorithm does not even seem to be following the intuition of generating higher scores for higher budget values. This could clearly be observed in the solutions produced for budgets 3060 and 3070. Refer to Figures 5.3 and 5.7. These results are caused by the way the algorithm works. The algorithm starts from the highest scored component, considering each in turn, progressing towards the lowest scored component until no more unused components could be added without exceeding the budget. Suppose the cost of a component does not allow it to be included, the algorithm leaves the component and continues with the next in line. For instance, at a budget bi , if adding component cx will cause the entire subset to exceed the budget, that component will be ignored and components cx+1 and cx+2 will be added given that the costx > costx+1 + costx+2 . However, in the next iteration for a budget bx+1 the component cx could be included as the budget has more room to add a component with a higher cost. None the less addition of components cx+1 and cx+2 will not be possible in this case, as there might not be any budget room available for any more components afterwards. Suppose the scorex < scorex+1 + scorex+2 , a drop of score over two iterations could be noticed. All in all it could be seen that the search space for the problem in review suits all the other implemented algorithms except for greedy algorithm. However, it could also be noted that all the techniques including greedy algorithm still have been able to surpass the performance of the expert judgement approach, generating solution subsets with much higher scores and number of components.

5.1.2

Effort Analysis

The effort analysis involves evaluating the efforts made by each of the algorithms and the way they react when they were enforced to restrict the efforts they make in generating solutions.

5.1. Results and Analysis

33

Experimental Methodology Effort counters were fixed in each algorithm in a way that it increments each time the algorithm evaluates a candidate solution which means each time the algorithms calculated the score of a candidate solution. For instance, effort counters were made to increase when a neighbour is evaluated in hill climbing and simulated annealing, while when a new generation is generated and evaluated in GAs. Note that the algorithms which were subjected to the entire set of experiments were SAHC, NAHC, Simulated Annealing and all variations of GA. The reason for greedy algorithm to be excluded was that the effort made by greedy algorithm was fairly small and constant at each run which was just 35 for the 35 features. The 35 feature subset selection experiments were carried out for each algorithm, recording the scores of the resulting subsets along with the effort taken for them to be generated at each experiment. Once, the algorithm with the lowest effort was found, the experiments were re-run but with a stopping criterion of reaching this particular effort limit. For instance suppose the lowest effort an algorithm had made is ei at a certain budget bound bi , then all the methods were re-run till they reached the effort ei in generating solutions for budget bi . The scores of the resulting subsets were recorded at each experiment. Results and Effort Analysis Once the above mentioned experiments were carried out the following results were generated. • Efforts made by each algorithm in generating solutions for each budget • Scores of the solutions generated by each algorithm in generating solutions for each budget before and after restricting their efforts to a low effort. Figure 5.1 presents the graph generated from the results generated for a selected number of budgets. The graph shows the results of the experiments carried at the budgets 100, 3060, 3600, 5300 and 5910. The first graph in each line presents the effort carried out by each method to generate the solution for the given budget, while the second graph in the line illustrates the change of scores in solutions generated for restricted efforts. White bars represent the scores generated before the efforts were restricted, while black bars represent the scores after effort restriction. The effort made by SAHC and NAHC were negligible, thus the values can not even be seen in the graphs. None the less, it was noticed from the results that SAHC made more effort than NAHC did. The reason behind this is quite obvious as SAHC makes more effort as it evaluates the entire set of neighbours to a current point as opposed to NAHC. The efforts made by the genetic algorithms which use the uniform crossover technique appear to be the same but are lower than the rest of the variations. The rest of the variations also seem to be having made an identical effort among them. There is also a reason behind these results. The variations which have used the uniform crossover technique must have made a smaller effort due to the fact that they get to deal with half the amount of offspring than the other variations do. As mentioned in section 2.1.3, uniform crossover technique produces one offspring out of 2 parents as opposed to other variations which produce 2 offspring after the combination of 2 parents take place. Therefore, GAs which use uniform crossover method only get to handle half the amount of search space the other variations handle. Suppose the variations of GAs implemented are categorised in to two as to the variations using uniform crossover and others, the variations in each category make the same effort as the search space which they handle are of the same size as the GAs implemented use a constant size for the individuals a generation could have. The analysis of how the techniques work when they were given a termination criterion of a lower bound was also pretty obvious. As each experiment resulted in NAHC making the least effort,

5.1. Results and Analysis

34

Table 5.1: Figure shows the efforts analysed in methods SAHC, NAHC, SA, GA1, GA2, GA3, GA4, GA5, GA6, GA7, GA8 and GA9 for selected budgets. Lines 1 till 5 illustrate the efforts analysis carried out at budgets 100, 3060, 3600, 5300 and 5910 respectively. The first graph in each line presents the effort carried out by each method to generate the solution for the given budget. The second graph in the line illustrates the change of scores in solutions generated for restricted efforts. White bars represent the Scores generated before the efforts were restricted, while black bars represent the scores after effort restriction.

5.1. Results and Analysis

35

the experiment was redone with the termination criterion as the effort made by NAHC for the corresponding budget bound. The graphs clearly show that the scores made by the techniques have reduced as they were enforced to make lower efforts than they did earlier.

5.1.3

Robustness Analysis

The robustness analysis involves evaluating the consistency of each of the algorithms implemented,in generating the optimal solutions and the speed at which they generate them. Experimental Methodology The 35 feature subset selection experiments were carried out 10 times for each algorithm recording the scores of the solutions produced at each run. In order to get a clear view of the variation of the results produced by each of the techniques in these iterations the values were normalised. For instance, each result x, for algorithm a, for a budget b, was recorded as the percentage difference from the mean of all the results for the budget b and algorithm a. Hence, if the result obtained is 15 and the mean of all the results is 20, then the normalised value of this result will be −25% and if the result obtained is 30 then the normalised value will be 50%. The same experiments were carried out and the normalised values of the effort taken by the algorithms were also recorded. The experiments were not carried out for greedy algorithm as it is clearly evident that the results produced by this technique in terms of both, the scores of the solutions generated and the effort taken for the process will always be consistent. Results and Robustness Analysis Figure 5.2 illustrates the 12 boxplots generated for each algorithm representing the variance of values of scores generated by each of them when the 35 feature subset selection experiments were applied on them at 10 iterations. Providing an even better view of the variance of these values is Figure 5.3 which illustrates the 12 boxplots generated for each algorithm representing the variation of results about their associated means. Similarly, figures 5.4 and 5.5 show the boxplots which portray the variations of the values obtained in terms of efforts taken and their percentage differences respectively. The initial observations of the graphs in figures 5.2 and 5.3 were that the solutions produced by simulated annealing were very consistent, while the variations of GAs were producing relatively small variations of results. The variations of hill climbing were producing the largest variance values. The graphs suggest that simulated annealing seems to be generating pretty steady solutions proving to be the most robust techniques implemented while the variations of GA too could prove be quite robust. However, the variations of hill climbers implemented seemed to be generating unstable results along the 10 iterations. It could be due to its random manner of encountering better solutions in its climb towards the global optimum. The graphs protrayed in figures 5.4 and 5.5 suggested that the efforts made by all variations of GAs are consistent. The efforts made by simulated annealing were showing smaller variance values while the efforts made by the variations of hill climbers were resulting in higher variance values. The robustness of GAs in terms of their speed in generating solutions seems to be very high, may be due to the fact that they operate on populations which contain an uniform amount of individuals. The hill climbers appear to be less robust even in terms of their speed once again due to its random manner of discovering better solutions while climbing towards the global maximum.

5.1. Results and Analysis

36

Table 5.2: From top to bottom, and left to right are the boxplots generated for algorithms SAHC, NAHC, SA, GA1, GA2, GA3, GA4, GA5, GA6, GA7, GA8 and GA9 respectively. The boxplots represent the variation of the values of scores generated by each of them when the 35 feature subset selection experiments were applied on them at 10 iterations.The X-axis of the boxplots represents the budgets used at each of the experiments while Y-axis represents the scores of the solutions generated at each experiment.

5.1. Results and Analysis

37

Table 5.3: From top to bottom, and left to right are the boxplots generated for algorithms SAHC, NAHC, SA, GA1, GA2, GA3, GA4, GA5, GA6, GA7, GA8 and GA9 respectively. These boxplots provide a better view of the variance of the values of scores generated by each of them when the 35 feature subset selection experiments were applied on them at 10 iterations.The X-axis of the boxplots represents the budgets used at each of the experiments while Y-axis represents the variations of the scores of the solutions generated at each experiment about their associated means.

5.1. Results and Analysis

38

Table 5.4: From top to bottom, and left to right are the boxplots generated for algorithms SAHC, NAHC, SA, GA1, GA2, GA3, GA4, GA5, GA6, GA7, GA8 and GA9 respectively. The boxplots represent the variation of the efforts taken by each of them to generate solutions when the 35 feature subset selection experiments were applied on them at 10 iterations.The X-axis of the boxplots represents the budgets used at each of the experiments while Y-axis represents the effort taken to generate the solutions at each experiment.

5.1. Results and Analysis

39

Table 5.5: From top to bottom, and left to right are the boxplots generated for algorithms SAHC, NAHC, SA, GA1, GA2, GA3, GA4, GA5, GA6, GA7, GA8 and GA9 respectively. These boxplots provide a better view of the variation of the efforts taken by each of them to generate solutions when the 35 feature subset selection experiments were applied on them at 10 iterations.The X-axis of the boxplots represents the budgets used at each of the experiments while Y-axis represents the variations of the efforts taken to generate the solutions at each experiment about their associated means.

5.2. Chapter Summary

5.2

40

Chapter Summary

The chapter has looked at different ways of evaluating the techniques implemented and their significance. It thereby has evaluated the techniques in terms of these measures such as the preformance, effort taken and the robustness in terms of generating results and making efforts in generating the solutions.

Chapter 6

Extension from Single-Objective to Multi-Objective Scenario The thesis so far has addressed the problem of optimisation in a single-objective scenario. It now focuses on making an extension to it from its single-objective aspect to a multi-objective aspect of the optimisation problem.

6.1

Problem Specification of the Multi-Objective Scenario

In addition to the specification made in chapter 3, Motorola also stated that each feature/component of a mobile phone occupies a particular amount of memory in a mobile phone. This memory size which is also called the footprint size differs from one feature to another. However, the company states that it prefers the amount of memory occupied by a feature subset to be as small as possible due to economical reasons. Therefore, an optimal feature subset requires to result in the least usage of memory in a mobile phone. In other words an optimal feature subset should result in the maximum capacity of memory left unused in the mobile phone. Therefore, the problem in review now could be considered as a multi-objective optimisation problem which deals with 2 objectives. The objectives being, the score(weighting) of a feature, which was considered in the single-objective scenario and the memory capacity a mobile phone could save by adding that feature. Hence, solving the multi-objective optimisation problem will be determining a particular subset which maximises both, the total sum of the weights and the total amount of memory left in the mobile phone. By the time the experimental work was carried out it was not possible to obtain a sample dataset with the footprint sizes occupied by the features, from the comapny. In order to carry on with the suggested work, sample footprint sizes were assigned to each feature. The footprint sizes assigned for the purpose of the experiments range between 1 and 9. The reason behind the allocation of these sizes was to match with the pattern of data such as the revenue and the customer desirability which were already allocated to each feature. As the revenue earned was depicted by numbers 1,2 and 3 where 1 represented the lowest level of revenue earned by a feature the footprint sizes ranging from 1 to 9 represented various levels of memory sizes where 1 denoted the smallest level of memory size occupied by a feature.

6.2. Design & Implementation Overview of the Multi-Objective Scenario

6.2

42

Design & Implementation Overview of the MultiObjective Scenario

The design and implementation of the system which solves the multi-objective optimisation problem was extended from the current system which was used to solve the single-objective aspect of the problem. This section describes an overview of the design and implementation taken place in extending the system which was implemented for the previous work discussed in the thesis.

6.2.1

Representation

The problem will be represented as a binary string as was done in the single-objective scenario.

6.2.2

Fitness Function

As mentioned earlier, the problem handles 2 objectives which are represented by the following fitness functions. 1. revenue ∗

1

customer desirability of a feature

This function which was also used in the single-objective scenario represents a score/weighting given to a feature. 2. 1 − footprint size This value gives an idea of the proportion of memory capacity which is left in the mobile phone when the corresponding feature is added to it. The reason behind using this fitness function is that the objective which should be considered in optimisation is the capacity a feature saves up when it is added to the device. Since the data available are the footprint sizes assigned to the features this fitness function was supported in order to get an idea of the proportion of memory capacity the corresponding features save up.

6.2.3

Detailed Design and Implementation

As the steps involved in solving the new problem are no different to the original system no modifications were carried out to its system and the database design. However, modifications were made to the components of the current system extending the current system, in order to solve the multi-objective optimisation problem. The modifications made were the addition of two new heuristic algorithms which solve multiobjective optimisation problems. The algorithms include vector generated genetic algorithm which was extended from the conventional genetic algorithms which were implemented earlier and multiobjective random algorithm which is a new algorithm to the system. Both the techniques were implemented with out changing the structure of the original system. Therefore, both the algorithms could use the data handling components implemented earlier to obtain the data on which they execute. Vector Evaluated Genetic Algorithm (VEGA) Vector evaluated genetic algorithm was designed and implemented as an extension made to the conventional GAs implemented earlier in the system. As mentioned in chapter 2, VEGA differs from the conventional GAs only from a modification made to its selection operator. For the purpose of this thesis, it was decided to create a VEGA which applies tournament selection as its selection operator and uniform crossover as its crossover technique. The reason

6.2. Design & Implementation Overview of the Multi-Objective Scenario

43

behind choosing these particular operators over the others was that the combination of these two operators (GA9) resulted in generating the most near optimal solutions out of the reults generated by all other combinations when they were applied in the single-objective scenario. Nevertheless, for it to be applied in VEGA, modifications were necessary to be made on the process implemented in tournamentSelection() which was implemented in the class GAManager. Method mTournamentSelection() was implemented in class GAManager modifying the process implemented in method tournamentSelection(). The method selects half of the individuals to be put in the mating pool using one fitness function while it selects the other half of the individuals using another fitness function. Although it was decided to design and implement VEGA as was suggested by Schaffer, it was also planned to include an additional process to the design of VEGA discussed in the thesis in order to obtain even better results. Once a new generation is evolved each of its individual solutions was decided to be analysed to see if it could be added to a set of pareto optimal solutions which was continued to be maintained till the last generation was produced. This way, VEGA results in generating populations through the reproduction of parents which were carefully selected by a fair selection process which considered both the objectives in an unbiased manner while the additional process considers each individual of the population and optimises both the objectives simultaneously resulting in equivalently optimal solutions. The class GAHandler implements the methods VEGA() and paretoOptimal() putting the above design in to action. The method paretoOptimal() considers each solution in each generated population and compares it with each of the solutions already being added to the set of pareto optimal solutions. If the current solution is dominated by a solution already included in the set (i.e both the objectives of the current solution are less than that of the solution already included in the set), it discards the current solution. If the current solution dominates a solution which is already included in the set(i.e both the objectives of the current solution are higher) the solution in the set will be removed from it and the current solution will be added to the set. The method VEGA() is a modification of the method geneticAlgorithm() implemented in the class GAHandler. It calls the corresponding selection and crossover methods implemented in GAManager which are necessary to perform Schaffer’s VEGA. The termination criterion, as was in geneticAlgorithm() is the completion of 200 iterations(generations) while the number of individuals included in a population is always set to 500. Once a new population is generated it will then call the method paretoOptimal in order to obtain a set of equivalently optimal solutions. The method VEGA() finally outputs the ultimate set of pareto optimal solutions which was generated through out the process. Multi-Objective Random Algorithm (MORA) In order to evaluate the performance of the VEGA implemented it was necessary to design and implement another algorithm which could solve the problem, in order to make a comparison. Keeping in mind the time limits available, multi-objective random algorithm was designed and implemented. Random algorithm is a simple meta-heuristic search algorithm which randomly creates a candidate solution and compares it with the best solution created so far until the termination criterion is met. The termination criterion for such algorithm would ideally be the completion of a certain number of iterations. Multi-objective random algorithm is an extension of the conventional random algorithm and is implemented by the class Random. The class implements the methods mRandom() and paretoOptimal() which are responsible for the operations involved. The algorithm was designed in such way that, a solution will be created randomly and evaluated in terms of both the fitness functions available for the problem after which it will be analysed to see if it could be added to a set of pareto optimal solutions which will continue to be maintained during the process. Method mRandom() implements the process which creates and evaluates a random solution at a

6.3. Evaluation

44

time and calls the method ParetoOptimal() which is responsible for the analysis made to ensure that the solution is pareto optimal. Method mRandom() will finally, output the final set of pareto optimal solutions which was generated through out the process. The termination criterion for the process implemented in mRandom() was completion of 100,000 iterations. The reason behind deciding on this termination criterion was to match the number of candidate solutions evaluated in VEGA, in order to instill fairness in the comparison between the two algorithms. However, the operation explained above was repeated for another 10 times in order to mitigate the effects of inherent randomness of the algorithm. The pareto optimal solution set was processed taking all the results generated over the 10 iterations in to consideration in order to result in a reliable outcome.

6.3

Evaluation

Evaluation of the algorithms which were developped in order to solve the multi-objective aspect of the optimisation problem was carried out by analysing their performance. The measure of performance evaluates the pareto fronts generated by both the algorithms in order to check which algorithm generates better pareto optimal solutions than the other. The general

Table 6.1: The grpahs from top to bottom and from left to right illustrate the pareto fronts generated by VEGA and MORA for the budgets 600,3060,3620,5300 and 6060. The Y and X axis of the 1 graphs represent the fitness scores Rank (revenue ∗ ) and Capacity customer desirability of a feature (1 − footprint size) respectively.

6.4. Chapter Summary

45

experimental methodology undertaken was to carry out the 35 feature subset selection experiments which were carried out for the analysis of performance of the techniques developped for the singleobjective scenario. The sets of pareto optimal scores generated over each of the experiments were recorded. Figure 6.1 illustrates a graphical visualisation of the pareto optimal fronts generated by both the algorithms for a seleted set of experiments. In detail, the figure shows the pareto fronts generated by the algorithms for the budgets 600, 3060, 3620, 5300 and 6060. These graphs represent a common pattern which could be noticed in majority of the graphs which were generated after each experiment. The pareto fronts generated with the pareto optimal solutions obtained from MORA were always dominating the pareto fronts generated by VEGA. This means that in each and every run, the pareto front generated by MORA contains more solutions that dominate solutions generated by VEGA with respect to all the objectives. Hence, the graphs imply that VEGA was outperformed by MORA in generating the pareto optimal solutions through out the experiments. The experiment results explain that for this particular problem, solutions randomly picked from the search space have resulted in a simultaneous optimisation of multiple objectives successfully than the solutions which were carefully selected and subjected to an evolutionary process.

6.4

Chapter Summary

The chapter looks at extending the project undertaken so far, to solve the multi-objective aspect of the optimisation problem. It acts as a mini project which includes problem specification, design and implementation carried out and the evaluation taken place for this extension.

Chapter 7

Summary of Results 7.1

Summary of Findings

This section presents a summary of the results obtained through out the project, and references to the sections in the thesis they are being discussed in. • Single-Objective Scenario. – Performance Analysis. 1. SAHC outperformed NAHC in solving the feature subset selection problem for a single-objective scenario. The scores of the solutions generated by SAHC in the feature subset selection experiments were either greater than or similar to the scores of the solutions generated by NAHC. Majority of the scores of the solutions generated by SAHC were greater than that of NAHC. Refer to section 5.1.1 on page 27. 2. GA9 outperformed the rest of the variations of GA which performed more or less the same. The scores of the solutions generated by GA9 in the feature subset selection experiments were either greater than or similar to the scores of the solutions generated by the rest of the variations of GA implemented. Majority of the scores of the solutions generated by GA9 were greater than that of the other variations. The rest of the variations of GA displayed values which were quite similar. Refer to section 5.1.1 on page 27. 3. Simulated Annealing performed the best with SAHC, GA9 and Greedy Algorithm following respectively while the expert ranking performed the least. Simulated Annealing produced solutions with the best scores with SAHC,GA9 and Greedy Algorithm following. However, ER produced solutions with the lowest scores. Refer to section 5.1.1 on page 27.

7.1. Summary of Findings

47

– Effort Analysis. 1. The variations of GA which used the uniform crossover technique appeared to have made an effort which was lower than the efforts made by rest of the variations of GA. GA3, GA6 and GA9 which used the uniform crossover technique resulted in effort counters which were lower than the counters the other variations resulted in. Refer to section 5.1.2 on page 32. 2. The variations of GA which used the uniform crossover techniques made an identical effort in generating results while the rest of the variations showed identical effort values among themselves. Suppose the variations of GAs implemented are categorised in to two as to the variations using uniform crossover and others, the variations in each category made the same effort. Refer to section 5.1.2 on page 32. 3. All the algorithms implemented performed less when the efforts they could make in generating solutions were forced to reduce The techniques implemented, showed a reduction in score values when they were run with a termination criterion of a lower effort they could make. Refer to 5.1.2 on page 32. – Robustness Analysis in Terms of Performance. 1. Simulated Annealing appeared to be the most robust technique impemented with variations of GA following. However, SAHC and NAHC were the least robust techniques when the performance was concerned. Simulated Annealing showed no variance in the scores of the solutions it generated over 10 runs while the variations of GA showed small variance values. The scores of the solutions produced by SAHC and NAHC over 10 runs showed very high variations. Refer to section 5.1.3 on page 35. – Robustness Analysis in Terms of Speed. 1. The robustness of all the variations of GA in terms of the speeds they made in generating solutions appeared to be very high with the robustness of SA following. The robustness of the hill climbers in terms of their efforts were very low. The variations of GA showed the most consistent values for the efforts they made in producing solutions over 10 iterations while Simulated Annealing showed a little variation of values for the efforts it made. Hill Climbers displayed the most unstable values in the effort counts which resulted over 10 runs. Refer to section 5.1.3 on page 35.

7.2. Chapter Summary.

48

• Multi-Objective Scenario. – Performance Analysis. 1. VEGA was outperformed by MORA in generating the pareto fronts for the multi-objective optimisation problem. The scores included in the pareto front generated by MORA were always dominating the scores included in the pareto front generated by VEGA with respect to all objectives. Refer to section 6.3 on page 44.

7.2

Chapter Summary.

This chapter provides a summary of the findings which were made in Chapter 5 and 6.

Chapter 8

Conclusion 8.1

Achievements

The work presented in the thesis characterises the problem of component selection and prioritisation as a search problem. It explains the ability of the concepts of SBSE to be applied in solving these problems arisen in software release planning. The thesis also presents empirical results which imply that heuristic search techniques produce solutions which exceed those produced by expert judgement alone. Different heuristic search techniques such as variations of hill climbing, simulated annealing, genetic algorithms and greedy algorithm were designed and implemented and were subjected to a set of feature subset selection experiments in order to determine the best subset of features available for a series of budgets. The experiments were applied on a set of real world features representing a set of components in a software components base. The results obtained from the experiments were utilised in order to evaluate the heuristic techniques involved. The evaluation process was taken place keeping in mind three types of evaluation measures such as performance of the algorithms, efforts they have made in keeping up with the performance and the robustness of the techniques in terms of the results they generate and the speeds at which they produce the results. In terms of the performance analysis, the results showed that all the automated techniques developed have outperformed the expert ranking, with the simulated annealing approach providing superior results out of all. The rest of the techniques in the descending order of their performance were variations of hill climbing, variations of genetic algorithms and greedy algorithm. When the efforts made were concerned, it was shown that the efforts made by hill climbing was negligible compared to the efforts made by simulated annealing and the variations of genetic algorithms. However, variations of genetic algorithms proved out to be making a greater effort than that of simulated annealing. The thesis also reports that the variations of GAs which used the uniform crossover technique were making an identical effort which was also less than the efforts made by the variations which used the rest of the crossover techniques. Moreover, the variations which used the rest of the crossover techniques too showed effort values which were identical among them. The results also demonstrated a striking robustness of solutions for the simulated annealing approach followed by GAs and hill climbing algorithms which displayed higher variances interms of the solutions generated compared to simulated annealing. On the other hand, the GAs proved to be the most robust technique in terms of the speed at which the solutions were generated. Further more, simulated annealing approach appeared to be the second most robust technique in terms of the speeds made leaving the variations of hill climbing the least robust techniques in this category. Although the thesis mainly focussed on the application of search based software engineering

8.2. Future Work

50

concepts in solving the single-objective aspect of optimisation it also paid attention on the multiobjective aspect of it to a certain extent. In doing this, the thesis also involved design and implementation of the algorithms VEGA and MORA which were used to solve the multi-objective aspect of the optimisation problem. Evaluations made on VEGA and MORA reported that MORA resulted in generating more optimal pareto fronts than that of VEGAs.

8.2

Future Work

As evident from the evaluations made on the techniques which solve optimisation problems that pay attention to a single objective, the evaluations made on VEGA and MORA were not completed. With the availability of more time it would have been possible to evaluate these techniques in terms of the efforts made and the robustness of the results generated and the speeds at which they were generated. Future work will consider the design and implementation of more multi-objective search algorithms and their evaluation which were not possible to be included in the thesis due to limitations of time. Further more, future work will also include enhancements which look at the multi-objective scenario which solve problems involving more than two objectives.

8.3

Chapter Summary

This chapter has given a critical analysis of the empirical studies undertaken highlighting the key achievements of the project, with an overview of the future enhancements that could be applied to improve the analysis.

51

Bibliography [1] Genetic algorithm. http://www.iba.k.u-tokyo.ac.jp/english/GA.htm. [2] Wikipedia, the free encyclopedia. http://en.wikipedia.org. [3] Paul Baker, Mark Harman, Kathleen Steinhofel, and Alexandros Skalliotis. Search-based approaches to the components selection and prioritisation problem. [4] Franco Busetti. Genetic algorithms overview. [5] Carlos A. Coello Coello. An updated survey of evolutionary multiobjective optimisation techniques: State of the art and future trnes. [6] Alexanre H. F Dias and Joao A. de Vasconcelos. Multiobjective genetic algorithms applied to solve optimisation problems. [7] Mark Harman and Robert Hierons et.al. Reformulating software enginerring as a search problem. [8] Mark Harman and Bryan F. Jones. Search based software engineering. Information and Software Technology, 43(14):833–839, December 2001. [9] Jean-Paul Rodrigue. Pareto optimality http://people.hofstra.eu/getrans/eng/chien/appl1en/pareto.html.

tradeoff

curve.

[10] Gunther Ruhe and Moshood Omolade Saliu. The science and practice of software release planning.

52

Appendix A

Source Code

Class DataHandler contains methods which are responsible for obtaining the feature details from the db and calculating the fitness score of each fitness /* * DataHandler.java * * Created on 19 February 2006, 16:02 */ package featureoptimisation; import java.sql.*; import java.lang.*; import java.lang.Object; import javax.swing.*; import java.util.*;

/** * * @author Nadisha de Silva * The class contains methods which are responsible for * obtaining the feature details from the db and * calculating the fitness score of each fitness */ public class DataHandler { private DbHandler dbHandler; private double fitness[]; private double revenue[]; private double customer[]; private int capacity[]; private int cost[]; private int budget[]; private int components; /** Creates a new instance of DataHandler */ public DataHandler() { dbHandler = new DbHandler(); } //obtains each feature's info fromt he db public void FeatureHandler(){ String Querry1 = ("SELECT COUNT(*) FROM Features " ); ResultSet rs1 = dbHandler.sqlHandler(Querry1); try{ while (rs1.next()) { components = rs1.getInt(1); } }catch(SQLException sqlException){ } System.out.println("Components: "+components); revenue = new double[components]; customer = new double[components]; cost = new int[components]; capacity = new int[components]; String Querry2 = ("SELECT * FROM Features " ); ResultSet rs2 = dbHandler.sqlHandler(Querry2); int i =0;

try{ while (rs2.next()) { double dCustomer = rs2.getDouble(1); customer[i]=dCustomer; System.out.println("Customer"+i+"\t"+customer[i]); int iCost = rs2.getInt(2); cost[i]= iCost; double dRevenue = rs2.getDouble(3); revenue[i] = dRevenue; int iCapacity = rs2.getInt(6); capacity[i]= iCapacity; i +=1; }//end while }catch(SQLException sqlException){ sqlException.printStackTrace(); } } //call this once a table is created for budgets public void budgetHandler(){ budget = new int[components]; String Querry3 = ("SELECT * FROM Budgets " ); ResultSet rs3 = dbHandler.sqlHandler(Querry3); int i =0; try{ while (rs3.next()) { int myBudget = rs3.getInt(2); budget[i]=myBudget; i= i+1; }//end while }catch(SQLException sqlException){ sqlException.printStackTrace(); } } /*calculate the fitness of each feature*/ public void CalculateFitness(){ fitness = new double[components]; for(int counter = 0; counter

Suggest Documents