Searching for Features using a Genetic Algorithm - CiteSeerX

3 downloads 0 Views 78KB Size Report
Intosh PowerBook G4 867 MHz. It took close to 50 hours to complete the training session, which is to be regarded as fairly fast. Lankhorst reports several hours ...
Searching for Features using a Genetic Algorithm Christer Johansson Bergen University, Norway {christer.johansson}@lili.uib.no

Abstract. Automatic classification of words use abstract representations of lexical items. The representations are usually not easily derived from the data available (strings of letters). This is a core problem in nearest neighbor methods. This article describes research towards a genetic algorithm for inventing features of relevance for automatic word classification. The GA attempts to optimize a representation created by a random process. Factors in genetic algorithms are identified for the task, and a novel solution is presented for creating features that can be used with large data sets of millions of words, with minimal demands on feedback. The result is an unsupervised categorization of a large corpus.

1 Introduction A Genetic Algorithm (GA) is a method to search for good solutions in a large population of candidate solutions. Methods based on gradient descent or hill climbing (e.g. backpropagation neural networks) gradually hone in on a solution that minimizes errors. In contrast, GAs use extensive search of the current candidate space, to find the currently best approximations to the unknown solution. The selected candidates are combined in various ways in order to improve future results. GAs could be said to implement an elitist solution to searching through the much vaster space of possible solutions. Only the best candidates are thought meaningful to spend more efforts on. Combining the best solutions in a disciplined way helps increasing the chances for good partial solutions to accumulate, whereas code which is not helpful is eliminated. GAs are less likely than the alternative hill climbing algorithms to get trapped in locally optimal solutions for several reasons. It searches through more possible starting points, and it is not limited to a continuous minimization of errors. The biologically inspired terminology of the genetic algorithm is used as a convenient metaphor for organizing its building blocks. A representation of a solution is called a chromosome. Each chromosome is a collection of variables called genes, which are stored at fixed positions in the chromosome. Mutations (randomly changed gene values) are used to introduce new variation into the population. A fitness value is calculated to aid the algorithm to find the best solution. Combining two chromosomes is referred to as cross-over. Instances of several chromosomes may be referred to as an organism. A population of organisms is the basis for selection—the candidate space. We are interested in selecting the best matching organism to each new organism that is about to enter the population. How can the algorithm search if it does not know what to search for? We need to create a representation for lexical forms. We do this by providing each new

2

Christer Johansson

word form with a chromosome consisting of initially random gene values. Instances of five words have likewise an associated organism consisting of five chromosomes. (An instance size of five words is used mostly for convenience.) A look-up structure called a lexicon is used to refer a chromosome to each word form. This search is efficiently implemented by numbering the words and using the index numbers to find the chromosomes. The task we are interested in is to find representations that can help a nearest neighbor (NN) mechanism [13] to find better matches to new input from a population of previous instances. The core engine of the mechanism is a memory based k-NN (k nearest neighbours) model with two levels of similarity: the feature and the word level, corresponding to genotype and phenotype in biological terminology. The process is reminiscent of genetic adaptation: the new input takes on features of the closest match. A trivial solution to finding close matches quickly is to make all representations identical. In that case, we have gained nothing by having a representation. Context is introduced to restrict the possible choices of matches. Words that occur consistently in the same or similar contexts will have their representations changed to a better match. The risk that all chromosomes become identical is determined mostly by the fitness function. Previous work with neural networks [12] used a simple recursive network to change lexical representations in a lexicon. The obvious solution, which would have minimized global errors, was in effect avoided and useful representations emerged for a rather small toy task. In contrast, we are planning to process millions of words using our mechanism. A search for a nearest neighbor will deliver one of the best matches to all new input from the current population of alternatives. As is usual in GA the best matching organism will be allowed to combine its chromosomes with those of the new input. The GA harmonizes the representations so that the same match on the word level will be even closer on the gene level the next time. The GA will reduce the variability of representations. Gene survival in nature has a similar tendency to reduce variability at the expense of low frequency expressions, such as mutations. In this article we will assume an initial population with a high rate of mutations (i.e., all words have initially random representations). The environment is a corpus where the words are seen in the context of other words. Random representations, if we use enough genes (features), have one attractive property for use in a NN-mechanism: they are highly selective and are very likely to be initially unique for each lexical item. A similar approach is described by Sahlgren [15], and previously a related model was researched by Johansson [6]. The general idea of feature creation is related to a well known technique, which is used in most serious spell checking software. The technique [1, 5] is referred to as either Bloomfilters or existential dictionaries. It efficiently represents that a word exists without representing the word itself. This somewhat amazing result is achieved using a number of hash codes. For example, 10 codes that depend on the entire word, are calculated for each word. Each code points to a specific bit set in a very large vector initially filled with zeroes. This vector is called the existential dictionary. When half of the bits in the dictionary are on, we have a (1/2)10 risk that a novel word has all its ten bits turned on in the dictionary. This is the risk of a false positive—to say that an unseen word was in the dictionary. If we want to lower that risk, we can simply set more bits and have a larger dictionary size. We will put this idea of existential dictionaries to work for generalization.

Searching for Features ...

3

2 The Model: definitions A textbook introduction to GAs is presented by Mitchell [13]. The version of GA presented in this article is a simple standard GA, except for some features that are due to the task of processing text. We have introduced a lexicon, where a phenotype (i.e., a word form) can be looked up to find its chromosome (i.e., the representation that the GA uses). Mutations are not used yet, but instead there is a very high degree of initial variation in the data set, which in some sense serves the purpose of random mutations. A last difference is that the fitness function relies on the best match that an instance based nearest neighbor model can deliver. More information on nearest neighbor models is given by Mitchell [13]. Johansson [6, 7] gives more information on the particular instance based model that is used in this article. A chromosome is a vector of N floating point numbers (a lexical representation) in the range [−11, +11]. The numbers represent the strength of the gene. Large absolute values means that many examples of this gene have been encountered. The chromosomes used in this paper contain 50 genes. A gene is a position in a chromosome. For lexical representations we may alternatively call the genes features. For word #3 feature #2 might be 0.26, and for word #4 the same feature could have the value −1.57. A word is a string of characters found in a collection of texts. Each word is coded as an index number (e.g., cat might be #3, and dog might be #13872). The number indexes the lexicon and assigns a single lexical representation (chromosome) to each word. Pseudo-code. I will present some pseudo-code for the main procedures of the algorithm, so that the reader could easily program a similar model. The actual program was written in C source code. I would like to concentrate on the architecture of the mechanism, and leave out some mathematical precision. The model is due to change and improve, as we get solutions to some of the problematic issues of feature creation (see sections 4 and 5). The architecture is especially important in relation to structures that tie features, words, and instances together. The most important might be to understand how the lexicon works in relation to words and chromosomes. A genetic feature match is a fuzzy identity operation on specified values, x and y, of a specific gene (e.g., #3) of two specified words (e.g., words #3 and #4). It answers the question: to what degree is the feature the same in the two words? The feature is judged to be the same to a degree calculated as z = xy. Same sign is a match, different signs indicate a mismatch. The output is restricted to the range [−1, +1]. As will be discussed later, the expected value of a feature is 0. A perfect match depends on the extreme values (more than 1, or less than −1). This means that in order to make all chromosomes have an optimal match with each other, all features have to be pushed towards extreme values of the same sign. double fuzzy_id(double x, double y) { double z; z=x*y; if (z>1) return 1; else if (z

Suggest Documents