Variable Length Genetic Algorithm for Feature

0 downloads 0 Views 137KB Size Report
For these domains, a specific type of feature selection approaches, ... of genetic algorithm developed specifically for this purpose. The paper ..... Generation No.
Proceedings of the International Conference on Robotics, Vision, Information and Signal Processing ROVISP2007

Variable Length Genetic Algorithm for Feature Selection in High Dimensional Domains Anwar Ali Yahya1, Ramlan Mahmod2 , Fatimah Dato Ahmad3 1 Faculty of Computer Science and Information Technology Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia Tel: +60-3-89422358, E-mail: [email protected] 2

Faculty of Computer Science and Information Technology Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia Tel: +60-3-89466556, E-mail: [email protected] 3

Faculty of Computer Science and Information Technology Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia Tel: +60-3-89466595, E-mail: [email protected]

Despite the diversity of these techniques, they can be grouped into two main categories: filter approaches and wrapper approaches [1]. For many domains that require feature selection, the number of features can be huge, and consequently the problem of feature selection becomes much more difficult. These domains are commonly known as high dimensional domains of which dialogue act recognition and text categorisation are two popular examples. For these domains, a specific type of filter approaches called ranking approaches is usually applied [6, 7]. It is a direct consequence of the computational efficiency of the ranking approaches and the inability of the wrapper approaches to scale well to the high dimensional domains. The general procedure of the ranking approaches is to score each potential feature using a particular metric, and then pick up the top k features. These metric can be either one-sided metric or two sided metric [9]. One-sided metric considers the features most indicative to the membership only which are called positive feature. Theses approaches never consider negative features, which come from non-relevant class, unless all the positive features have already been selected. On the other hand, two-sided metrics do not differentiate between the positive and the negative features, they implicitly combine the two. Examples of the ranking approaches are given in Table 1

Abstract Feature selection is a task of crucial importance for machine learning applications. High dimensional domains are characterized by a huge number of features which makes feature selection much more difficult. For these domains, a specific type of feature selection approaches, called ranking approaches, are applied. Despite the computational efficiency and success of these approaches, they still have some drawbacks. First, the selection is independent of the inductive and representational biases of the machine learning algorithms. Second, they are unable to discover the correlation between the candidate features. Third, they do not exploit the negative features in an optimal way. This paper presents a new approach for feature selection in high dimensional domains that handles the aforementioned drawbacks. It essentially based on a variable length version of genetic algorithm developed specifically for this purpose. The paper presents an example of the application of the proposed approach for the selection of lexical cues in context of building dialogue act recognition model. The performance of the proposed approach is compared with some ranking approaches in term predictivity and aggressivity. The results provide experimental evidences on the ability of the proposed approach to handle the drawbacks of the ranking approaches.

Table 1 – Ranking Approaches Metrics

Keywords: Feature Selection, Genetic Algorithm, Ranking Approaches, Dialogue Act Recognition

Introduction Feature selection is a task of crucial importance for Machine Learning (ML) applications as it affects the accuracy of the learned function, the learning time, the number of training examples. The literature of ML reveals vast numbers of techniques developed for feature selection.

Metric

Type

Formula

Mutual Information (MI)

One- sided

Odds Ratio (OR)

One-sided

OR ( f , c )

Information Gain (IG)

Two- sided

IG( f , c) =

Chi Square (χ2 )

Two- sided

MI ( f , c ) = log =

P ( f , c) P ( f ). p (c )

P ( f | c ).( 1 − P ( f | c ) (1 − P ( f | c )). P ( f | c )

P( f , c)

∑ ∑p( f , c).logP( f ).p(c)

c∈{c,c} f ∈{ f , f )

2

40

2

χ ( f ,c) =

N .[P( f ,c).P( f ,c)−P( f ,c).P( f ,c)] P( f ).P( f ).P(c).P(c)

Proceedings of the International Conference on Robotics, Vision, Information and Signal Processing ROVISP2007

Despite computational efficiency and proven success of the ranking approaches, they still have a number of drawbacks. First, in the ranking approaches the selection is independent of the inductive and representational biases of the ML algorithms [1]. In other words, the selection of the feature is based on the intrinsic properties of the data, rather than the way in which these features are used by ML. The potential impact is the inclusion of some irrelevant features and the exclusion of the relevant ones. Second, the ranking approaches are also unable to discover the correlation between the candidate features, which results in a redundant selection [2, 6], due to the strong correlation between the features, the informativeness of the selected features is much less than the summation of the information brought by each feature. Third, in the imbalanced datasets, the ranking approaches either ignore the role of negative features or combine them in a non-optimal way [10]. The one sided metric do not consider negative features, because they rank them after positive features. A two-sided metric implicitly treats positive and negative features equally, regardless of the fact that the values of positive features are not comparable with those of negative features.

the fixed length binary representation of chromosome is appropriate for the domain with small number of features. If the maximum number of features is too large, the computation resources required during GA evolution will be very high, particularly with wrapper approaches. The case becomes worse when only small numbers of these features are relevant.. An alternative choice is the non binary representation that encodes each chromosome as the selected features, rather than the presence or absence of the features. But enforcing the length of the chromosome to be fixed imposes a limit on the maximum number of relevant features, which is unknown a priori. In the following sections the elements of the developed VLGA are presented. For detailed description of the conventional GA refer to [5]. Representation Scheme The proposed VLGA-based approach is based on variable length non-binary representation scheme, in which each chromosome represents the selected subset of feature. It is a direct representation scheme with no encoding or decoding process to map between the genotype and phenotype levels. Figure 2 shows an example of a chromosome represented using this scheme. An interesting aspect of this representation is that it is positional independent representation scheme. This means the genome position has no role in determining the aspects of the chromosome at the phenotype level.

The remainder of this paper is organised as follows. First, the proposed VLGA-based approach for feature selection is introduced. Then, an example of its application for the selection of lexical cues in the context of building dialogue act recognition model is described. After that, results and discussions are presented followed by a conclusion.

VLGA-based approach From the above discussion, it is obvious that the drawbacks of ranking approaches make them suboptimal for feature selection. This is a direct consequence of the nature of selection scheme adopted by these approaches, which is merely based on a simple local search. In the general perspective, the optimal feature selection approach should be based on Markov blanket in the faithful Bayesian networks [8], which calls for an exhaustive search in the feature space. Due to its inefficiency for high dimensional dataset, they are seldomly used. Therefore, heuristic search approaches become a compelling alternative to intermediate between these two extremes. GA is a striking example of the heuristic search approaches which has reported good empirical results for feature selection in several domains [4, 9]. In these works, GA with fixed length and binary representation is applied, therefore each chromosome in the GA population is a binary string with length equal to the maximum number of features. Accordingly, a bit value of 1 in the chromosome representation means that the corresponding feature is included in the specified subset, and a value of 0 indicates that the corresponding feature is not included in the subset as shown in the Figure 1.

Figure 2 - Variable Length Chromosome Features Space Mask Technically, GA search is a process that explores the promising points in the search space through genetic operations. Hence, the representation scheme and the genetic operators should provide a mechanism that ensures a good exploration of the search space. Using the proposed representation scheme directly does not help the genetic operators to explore new points in the search space. For example, suppose that the current population of VLGA contains n chromosomes made up of m features, the generated chromosomes will, definitely, consists of a combination of m features. Therefore, to ensure a good re-sampling of the space, the mask is introduced. The space mask is a binary string of length equal to the size of features space. It marks the status of each feature in the features space. The value 1 indicates that the feature is in use in the current population and the value 0 indicates that the feature is not in use. Figure 3 shows part of the feature space mask for the DAR domain.

Figure 3 - Features Space Mask

Figure 1 - Fixed Length GA Chromosome The advantage of this representation is that the conventional GA can be used without any modification. Unfortunately, 41

Proceedings of the International Conference on Robotics, Vision, Information and Signal Processing ROVISP2007

cope with the variable length representation. Two chromosomes are selected randomly. If the length of the shorter one is chosen to be the length of the offspring, then a uniform crossover is performed between the shorter chromosome and a segment from the longer chromosome of the same length. If the longer chromosome is chosen, then a uniform crossover is performed between the shorter chromosome and an equal length segment of the longer chromosome. Then, the remaining parts of the longer chromosome are appended to the beginning and the end of the offspring. Figure 4 shows how it is being performed schematically.

Fitness Function The design of the fitness function should take into account the drawbacks of ranking approaches. First, in order to avoid selecting irrelevant features, the fitness function should depend on the predictivity of the selected features which should measure the informativeness of the selected subset of features according to the way in which they will be used in the subsequent stage. To avoid redundant selection, the fitness function should account for the correlations between the selected features by evaluating informativeness of selected features as a whole, rather than evaluating each feature individually and then assuming that the predictivity of the selected feature is equal to the summation of their individual predictivity. There is also a possibility that two chromosomes of different lengths have the same predictivity values. In order to guide the selection towards the minimal number of phrases, the fitness function should utilise the aggressivity measure introduced by [3] as penalty on the length of the chromosome. Accordingly, the fitness function combines the predictivity and aggressivity as follow Fitness = predictivi ty −

f F

(1) Figure 4 – VLGA Uniform Crossover Example

Finally, with regards to the optimal utilisation of the negative features in imbalanced dataset, the idea is to use the negative features to help the positive features to confidently reject an instance which does not belong to the target class. This can be realised by associating to each positive feature a number of negative features to increase its informativeness.

Mutation Since the proposed version of VLGA depends on a non-binary representation, the proposed approach of mutation is to replace the values of some genes by new values from the features space which are not participating in the current population. This operator is performed with assistance of features space mask. More specifically, for each gene value in the selected chromosome, the value of each gene is replaced by a value from features space, which has its status in the features space mask as inactive. Figure show the mutation operator schematically.

VLGA Selection Scheme VLGA makes use of the q-(k, r) tournament selection scheme. This scheme randomly chooses q chromosomes from the current generation and with certain probability k return the best chromosome or otherwise return the worst chromosome. VLGA Genetic Operators VLGA makes use of three genetic operators from the conventional GA with some modifications in order to suite the proposed representation scheme. Moreover, VLGA introduces a new operator called AlterLength. Reproduction The reproduction operator of the proposed VLGA is similar to the reproduction operator of conventional GA. With the reproduction probability Pr, a chromosome is randomly selected from the current generation, and then copied into the new generation without any modification.

Figure 5 – VLGA Mutation Example AlterLength The crossover and mutation operators are designed specifically to introduce variation to the content of the chromosome. To introduce a variation to the length of the chromosome, the AlterLength operator is proposed. The AlterLength operator randomly shrinks or expands the chromosome by adding or removing a single phrase from or to the chromosome. In case of addition, the added phrase is

Crossover For VLGA, the uniform crossover, as previously proposed for fixed length GA, has been adapted. A uniform crossover [5] is an operator that decides with probability Pc, which parent will contribute to each of the gene values in the offspring chromosomes. This operator has been adapted to

42

Proceedings of the International Conference on Robotics, Vision, Information and Signal Processing ROVISP2007

randomly selected from inactive phrases in phrases space. The AlertLength operator is performed with a probability Pal. Figure 6 shows an example of AlterLength operator.

suitable for VLGA. The proposed processes involve tokenisation, removing morphological variation, semantic clustering, N-gram phrases generation, and removing less frequent phrase.

Results and Discussion Ranking Approaches The goal of these experiments is to evaluate the performance of the previous ranking approaches on the selection of the lexical cues. Each of the aforementioned ranking approaches is applied to rank the phrases. More specifically, for each DA, each ranking approach ranks the phrases according to its own metric. To evaluate each ranking approach for the selection of informative lexical cues for each DA, the predictivity value of each k phrases in each list is calculated, (k=1,2 … n), and the maximum value of predictivity and its associated aggressivity value is recorded.

Figure 6 – VLGA AlterLength

VLGA for Lexical cues selection

In terms of predictivity, the on one-sided metric outperform on two-sided metrics, even if they are of the same category. A striking example of this is IG, which is a two-sided metric extended from the MI (one-sided metric). It is apparent that the predictivity values obtained using MI are much better that those obtained using IG. The performance of MI, and OR are comparable. In terms of aggressivity, although MI and OR achieve high predictivity values, they need a notable large number of phrases (low aggressivity). Both IG and χ2 have the highest aggressivity values but the lowest predictivity values.

To evaluate the proposed approach, it has been applied for feature selection in the context of designing dialogue act recognition model. An essential issue arises while building the DBNs model for dialogue act recognition is the specification of the DBNs random variables. For this model, it has been proposed that the number of the random variables should be equal to the number of DAs. Moreover, each variable is defined as a logical rule of disjunctive normal form which consists of a set of lexical phrases which are informative to certain DA. The set of phrases that will constitute each random variable are selected automatically from the set of lexical phrases. The predictivity of the set of cues is defined as follow TP + FN predictivity = N

VLGA for Positive Lexical Cues Selection

(2)

The goal of these experiments is to assess the performance of the proposed VLGA-based approach on the selection of positive lexical cues. The proposed VLGA based approach is applied to select the positive lexical cues for the aforementioned DAs. The parameters of VLGA are Population Size = 500, q = 10, k = 0.7, r = 0.3, Pc = 0.7, Pr = 0.1, Pal=0.2, Pm = 0.01, and the stopping criterion is to stop if there is no significant change within 10 generations.

where TP is the number of time the selected cues give true when the instance belongs to the target DA, FN is the number of times the selected cues gives false when the instances does not belong to the target DA, and N is the total number of instances. For this application, the dialogue corpus is a collection of 64 dialogues in the domain of information exchange and transaction in a theatre. To prepare the phrases space, a sequence of processes proposed to convert the original dialogue corpus into new space

Table 2 – Results of Ranking Approaches Lexical Cues Selection DA statement query-if

MI

Χ2

IG

OR

Pred.

Aggr.

Pred.

Aggr.

Pred.

Aggr.

Pred.

Aggr.

0.8652

0.6751

0.6385

0.9948

0.6619

0.9993

0.8578

0.6602

0.9658

0.9820

0.9580

0.9993

0.9580

0.9993

0.9653

0.9873

query-ref positive-answ er negative-answ

0.9047

0.9042

0.8149

0.9940

0.8149

0.9940

0.9018

0.9349

0.8642

0.7927

0.7860

0.9993

0.7860

0.9993

0.8666

0.8136

0.9702

0.9933

0.9595

0.9985

0.9595

0.9985

0.9702

0.9933

er no-blf

0.8070

0.8398

0.7333

0.9985

0.7333

0.9985

0.8051

0.8413

43

Proceedings of the International Conference on Robotics, Vision, Information and Signal Processing ROVISP2007

are highly correlated. The inability of these approaches to account for the correlation between the selected features leads to the selection of huge number of lexical cues which in turn affects the subsequent stages of ML application that depend on the selected features. Moreover, the results confirm the inability of the ranking approaches to exploit the negative features in the imbalanced data domains.

Table 3 – Results of VLGA Lexical Cues Selection DA

Pred. 0.871031 0.964338 0.903273 0.852467 0.96922326 0.81143135

statement query-if query-ref positive-answer negative-answer no-blf

Aggr. 0.967814 0.996257 0.994012 0.980539 0.998503 0.983533

From another side, the ability of the proposed VLGA based approaches to account for the correlation between the selected cues enable them to select the minimal number of cues that give the maximum informativeness. It is obvious from the high reduction of the number of the selected cues that give predictivity values comparable to that obtained by the ranking approaches. In contrast to the ranking approaches, the proposed VLGA show its ability to exploit the negative cues for a better informativeness of the selected cues. The VLGA-based approach associate to each positive cue a number of negative cues that are used to help it in the rejection of data instances that contain the positive cues but not belonging to the target DA type.

VLGA evolution for Statement DA 1

0.9

0.8

0.7

Generation No

0.6 1

21

41

61

81

101

Predictivity

121

A ggressivity

141

161

181

Fitness

Figure 7 – VLGA Evolution for Positive Cues Selection

References

VLGA for Positive and Negative Lexical Cues Selection The goal is to assess the ability of the proposed VLGA-based approach on the exploitation of the negative lexical cues. each phrase that occur within utterances belong to the target DA is marked positive, and each phrase that occurs within utterances not belonging to the target DA is marked negative. It could happen that some phrases occur in the utterances that are labeled with the target DA and in the utterances not labeled with the target DA, hence it might be possible to find phrase marked negative and positive.

[1] Blum, A., and Langley, P. 1997. Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence, 97:245–271. [2] Fleuret, F. 2004. Fast Binary Feature Selection with Conditional Mutual Information. Journal of Machine Learning Research, 5:1531–1555. [3] Galavotti, L., Sebastiani, F., and Simi, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of ECDL-00, Lisbon, Portugal, 59–68.

Table 4 – Results of VLGA Lexical Cues Selection DA

Pred.

Aggr.

statement query-if query-ref positive-answer negative-answer no-blf

0.889595 0.968246 0.91744 0.869565 0.967269 0.832437

0.964 0.991217 0.968074 0.954341 0.999251 0.961826

[4] Liu, J, Iba, H., and Ishizuka, M. 2001. Selecting Informative Genes with Parallel Genetic Algorithms in Tissue Classification. Genome Informatics, 12: 14-23. [5] Mitchell, M. 1996. An Introduction to Genetic Algorithms. Cambridge: MIT. Press. [6] Samuel, K., Carberry, S., and Vijay-Shanker, K. 1999. Automatically Selecting Useful Phrases for Dialogue Act Tagging. In Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics.

VLGA evolution for Statement DA 1

[7] Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1): 1–47.

0.9

[8] Tsamardinos, I., Aliferis, C., and Statnikov, A. 2003. Algorithms for Large Scale Markov Blanket Discovery. In Proceedings of the 16th International FAIRSC.

0.8

0.7

[9] Yang, J., and Honavar, V. 1998. Feature Subset Selection using a Genetic Algorithm. IEEE Intelligent Systems, 13(2): 44-49.

Generation No

0.6 1

51

101

Predictivity

151

Aggressivity

201

Fitness

Figure 8 – VLGA Evolution for Pos(Neg) Cues Selection

[10] Zheng, Z., Wu, X., and Srihari, R. 2004. Feature Selection for Text Categorization on Imbalanced Data, SIGKDD, 6(1): 80-89.

Conclusion The first conclusion that ca be drawn is that the ranking approaches are not the optimal approaches for lexical cues selection in high dimensional domains where the features 44

Suggest Documents