Evolving Text Classifiers with Genetic Programming

14 downloads 0 Views 178KB Size Report
A character N-Gram is an N-character slice of a longer string so for example the word. INFORM can be ... They found that adjacent phrases tended to be better than Boolean phrases in .... with an average size of approximately 300 schemas.
Evolving Text Classifiers with Genetic Programming 1

1

Laurence Hirsch , Masoud Saeedi , and Robin Hirsch 1

2

School of Management, Royal Holloway University of London, Surrey, TW20 OEX, UK 2 University College London, Gower Street, London, WC1E 6BT, UK

Abstract. We describe a method for using Genetic Programming (GP) to evolve document classifiers. GP’s create regular expression type specifications consisting of particular sequences and patterns of N-Grams (character strings) and acquire fitness by producing expressions, which match documents in a particular category but do not match documents in any other category. Libraries of N-Gram patterns have been evolved against sets of pre-categorised training documents and are used to discriminate between new texts. We describe a basic set of functions and terminals and provide results from a categorisation task using the 20 Newsgroup data. Keywords: Text categorisation, N-Gram, Genetic Programming

1 Introduction Automatic text categorization is the activity of assigning pre-defined category labels to natural language texts based on information found in a training set of labelled documents. Text categorization can be viewed as a special case of the more general problem of identifying a category in a space of high dimensions so as to define a given set of points in that space. Quite sophisticated geometric systems for categorization have been devised [1]. In recent years it has been recognized as an increasingly important tool for handling the exponential growth in available online texts and we have seen the development of many techniques aimed at the extraction of features from a set of training documents, which may then be used for categorisation purposes. Classifiers built on the frequency of particular words in a document (sometimes called bag of words) are based on two empirical observations regarding text: 1. the more times a word occurs in a document, the more relevant it is to the topic of the document 2. the more times the word occurs throughout the documents in the collection the more poorly it discriminates between documents. A well known approach for computing word weights is the inverse document frequency (tf-idf) weighting [2] which assigns the weight to a word in a document in proportion to the number of occurrences of the word in the document and in inverse proportion to the number of documents in the collection for which the word occurs at least once, i.e. M. Keijzer et al. (Eds.): EuroGP 2004, LNCS 3003, pp. 309–317, 2004. © Springer-Verlag Berlin Heidelberg 2004

310

L. Hirsch, M. Saeedi, and R. Hirsch

N aik = f ik log   ni 

(1)

where a is the weight of word i in document k, fik is the frequency of word i in document k, N the number of documents in the collection and ni equal to the number of documents in which ai occurs at least once. A standard approach to text categorization makes use of the classical text representation technique [2] that maps a document to a high dimensional feature vector, where each entry of the vector represents the presence or absence of a feature [3]. This approach loses all the word order information only retaining the frequency of the terms in the document. This is usually accompanied by the removal of non-informative words (stop words) and by the replacing of words by their stems, so losing inflection information. Such sparse vectors can then be used in conjunction with many learning algorithms [4]. There are a number of alternative strategies to the reliance on pure word comparison, which can provide additional information for feature extraction. In particular, NGrams (sequences of N characters) and sequences of contiguous and non-contiguous words (phrases) have also been used for classification [5], [6]. In this paper we describe a method to identify a set of features in the form of N-Gram patterns using Genetic Programming (GP) [7] with fitness based on the tf-idf principle. Although GP has been used in the area of text mining before [8] we are unaware of other implementations that have used GP’s to evolve classifiers based on GP created NGram patterns. In the next section, we review previous work with N-Grams and with phrases. We then provide information concerning the implementation of our application and the initial results we have obtained on a classification task. 1.1 N-Grams A character N-Gram is an N-character slice of a longer string so for example the word INFORM can be represented by the 5-grams _INFO, INFOR, NFORM, FORM_ where the underscore represents a blank. If we count N-Grams that are common to two strings, we get a measure of their similarity that is resistant to a wide variety of grammatical and typographical errors. N-Grams have proven a robust alternative to word stemming, having the further advantage of requiring no linguistic preparations [5], [9], [10]. A further useful property of N-Grams is that the lexicon obtained from the analysis of a text in terms of N-Grams of characters cannot grow larger than the size of the alphabet to the power of N. Furthermore, because most of the possible sequences of N characters rarely or never occur in practice for N>2, a table of the NGrams occurring in a given text tends to be sparse, with the majority of possible NGrams having a frequency of zero even for very large amounts of texts. Tauritz [11] and later Langdon [12] used this property to build an: adaptive information filtering system based on weighted trigram analysis in which genetic algorithms were used to determine weight vectors. An interesting modification of N-Grams is to generalise NGrams to substrings which need not be contiguous [13].

Evolving Text Classifiers with Genetic Programming

311

1.2 Phrases The notion of N-Grams of words (i.e. sequences of N words, with N typically equals to 2, 3, 4 or 5) has produced good results both in language identification, speech analysis and in several areas of knowledge extraction from text [14]. Pickens and Croft [6] make the distinction between ‘adjacent phrases’ where the phrase words must be adjacent and Boolean phrases where the phrase words are present anywhere in the document. They found that adjacent phrases tended to be better than Boolean phrases in terms or retrieval relevance but not in all cases. Restricting a search to adjacent phrases means that some retrieval information is lost.

2 Implementation Our approach aims to identify sets of both adjacent and non-adjacent sequences of NGrams found in a text environment that may be useful as classifiers. We use the symbolic capability of GP to combine sequences with functions such as OR, and NAND to provide a set of complex features which may be used as a subtle and sophisticated mechanism for discriminating between documents. • The basic unit (or phrase unit) we use is an N-Gram (sequence of N characters). • An N-Gram is said to match a word in a document if it is the same character string as the word or is a substring of the word. We do not include space characters in the N-Grams. • Sequences of N-Grams can be produced by GP’s and evaluate to true or false for a particular document depending on whether a match for the sequence exists in the text of that document. • Special symbols can be included in the specification produced by the GP’s for example a ‘+’ symbol indicates that the two N-Grams which follow must be found in adjacent words in a document i.e. an adjacent phrase in the terminology of Pickens and Croft [6]. • A GP produces N-Gram based schemas which evaluate to true or false (i.e. match or do not match) in any document in the same way as which we can say a particular word occurs or does not occur in a document. • GP’s can also combine schemas using Boolean functions. • Classifiers must be evolved for each category c of the dataset. Fitness is then accrued for GPs producing Boolean N-Gram schemas which are true for training documents in c but are not true for documents outside c in an tf-idf manner. Thus the documents in the training set represent the fitness cases. The task involved categorising documents selected from the 20 Newsgroup dataset. 2.1 The 20 Newsgroup Dataset The 20 Newsgroup corpus collected by Lang [15] contains about 20 000 documents evenly distributed among 20 UseNet discussion groups. This natural language corpus

312

L. Hirsch, M. Saeedi, and R. Hirsch

is employed for evaluating text classification techniques [16], [17], [18]. Many of these groups have similar topics (e.g. five groups discuss different issues concerning computers). In addition many of the documents in this corpus are present in more than one group (since people tend to post articles to multiple newsgroups). Therefore, the classification task is typically hard, and suffers from inherent noise, while trying to estimate the relevant probabilities [18]. 2.2 Pre-processing Before the evolution of classifiers a number of pre-processing steps are made. 1. All the text in the document collection is placed in lower case. 2. Numbers are replaced by a special character and non-alphanumeric characters are replaced by a second special character. 3. All the documents in the training data are searched for unique N-Grams which are then stored in sets for size of N=2 to N=max_size. The size of these sets can be reduced by requiring that an N-Gram occur at least σ times (e.g. 2 or more instances of the N-Gram) or that the N-Gram occur in at least σ documents. Note that the use of N-Grams as features makes word stemming unnecessary and that a stop list is not required because of the natural screening process provided by the fitness test (see below). 2.3 Fitness GPs are set the task of assembling single letters into N-Gram strings and then forming a Boolean schema of N-Gram patterns. The schema is then evaluated against a set of documents from each category of the training set. An N-Gram matches a word in a document if it is the same as that word or is a substring of that word. Where a schema is simply a sequence of N-Grams it is said to match a document if each element of the schema matches a word in the document and each subsequent element of the schema matches a word further along the document text. The function returns a Boolean value indicating whether the entire sequence has a match in a document. GP’s may combine N-Gram sequences with Boolean functions. A library of highly fit Boolean schemas must be evolved for each category of the dataset. For a particular category the GP’s are searching for patterns which occur in a number of documents within the category but occur infrequently or not at all in any other document in any other category. A simple fitness measure is summarized by the pseudo code below (where the lower the fitness the better the individual). IF CountOfMatchingDocumentsInTopic=0 THEN Fitness := High_Value ELSE Fitness := ((CountOfMatchingDocumentsNotInTopic * penalty) + occurrence) / CountOfMatchingDocumentInTopic

Evolving Text Classifiers with Genetic Programming

313

We generally used a high penalty (e.g. 20) for matching documents not in the topic category and an occurrence around 3 requiring that the N-Gram pattern occur in more than 3 documents to achieve a fitness value below 1. It may also be useful to penalise GP’s producing schemas including N-Grams of size 1 as a method of reducing redundant entries in a library. 2.4 GP Types We use a strongly typed tree based GP system with the following types: Table 1. GP Types

GP Type N-Gram Special Symbol N-Gram Sequence Boolean

Description A sequence of one or more characters. A symbol used as part of an N-Gram sequence. For example ‘+’ indicates that the following two N-Grams in an N-Gram sequence must be substrings of words which occur together in a document text with no words between them. A sequence of one or more N-Grams and special symbols AND, NOT used with N-Gram sequences as arguments

Each GP will output a Boolean N-Gram schema. Any schema output by a GP can be evaluated against a document and the document can be said to match the schema or not match the schema. 2.5 GP Terminals 26 lower case alphabetic characters (a-z). # meaning any number. ~ meaning any non-alphanumeric character. 2.6 GP Functions The GP’s are provided with a variety of protected string handling functions for combining characters into N-Gram strings, concatenating N-Grams, and adding NGrams to a sequence of N-Grams. Most combinations of letters above an N-Gram size of 2 are unlikely to occur in any text, for example where N-Gram size =6 only 0.001% of possible N-Grams are found in the 20 Newsgroup data. We therefore guide the GP’s through the vast search space of possible N-Gram patterns by the provision of special ‘Expand’ function. The following function initially forms a new N-Gram by combining two N-Grams (N-Gram1 and N-Gram2). A new N-Gram NewN-Gram is formed by appending N-Gram2 to N-Gram1. If the NewN-Gram is of length l the expand function checks if the NewN-Gram is in the set of N-Grams of size l originally extracted from all the text in the all the training documents used. If it is found in this set NewN-Gram is returned. If not the nearest occurring N-Gram in the

314

L. Hirsch, M. Saeedi, and R. Hirsch

set is returned. If the set is empty the original N-Gram N-Gram1 is returned. Note that the ‘Expand’ function will not return an N-Gram consisting of a sequence of characters which never occurs in the training data. The function is summarised by the following pseudo code SetOfL-Grams = the ordered set of N-Grams of length L found in documents from the training set Expand (N-Gram1, N-Gram2) NewN-Gram = (concatenate N-Gram1, N-Gram2) L = (sizeOf NewN-Gram) IF (IsEmpty SetOfL-Grams) RETURN N-Gram1 ELSE IF (IsMember NewN-Gram SetOfL-Grams) RETURN NewN-Gram ELSE RETURN Nearest N-Gram in SetOfL-Grams The size of the N-Gram sets can be reduced by extracting them only from the text of documents belonging to the category for which the library is being evolved. This will greatly reduce the search space of the GP’s but some discriminating ability will be lost where a Boolean NOT function is included in the GP function set. 2.7 Measures We use the following performance measures to test the effectiveness of the evolved classifiers.

p=

a a+c

(2)

r=

a a+b

(3)

where: • • • • •

a = the number of documents correctly assigned to the category b = the number of documents incorrectly assigned to the category c = the number of documents incorrectly rejected from the category p = precision r = recall

For evaluation we used the F1 performance measure given by

F1( p, r ) =

2 pr p+r

Note that then F1 measure gives equal weighting to both precision and recall [19].

(4)

Evolving Text Classifiers with Genetic Programming

315

2.8 Library Building When a frequent pattern is found to have a preset minimum fitness (e.g. less than 1) the regular expression is stored in a library. Once a pattern is stored in the library any GP producing such a pattern will accrue a fitness penalty (normally +1). The purpose of this mechanism is to keep the GP population continuously searching for new patterns whilst leaving the GP producing the discovered pattern a good chance of remaining in the population and thus potentially contributing to the evolution of new useful patterns through genetic operators. The graph in Fig. 1. shows a typical pattern of evolution. Population 800 Generation 40 Runs 4

Best of Generation Fitness LibrarySize/100

5.00 4.00 3.00 2.00 1.00 0.00 1

2

3

4

Run Number/Generations

Fig. 1. Evolution of a classifier

Once the GP population has built a library for a particular topic the library can be used for document categorisation purposes. We suggest that a document from the test set which matches patterns stored in the library for a particular category is more likely to be in that category.

3 Experiment We used a strongly typed, tree based GP system with a population of 800 GP’s. The system was run for 4 GP runs of 40 generations for each category. We used 10 categories from the 20 News Group corpus and selected 200 documents per category for training purposes. A library of schemas was evolved for each of the 10 categories with an average size of approximately 300 schemas. Every document from the test set was then assigned to a category based on the number of schemas matching that document from each of the libraries. Although this is only a subset of the entire corpus (10 out of 20 categories) and cross-method comparison is often problematic [20] the results above would seem to compare favourably with other methods used to categorise this data set [18]. We intend to perform an in-depth comparison with other methods using the complete data set, to identify exactly where the GP system is outperforming other methods and where it may be used to complement other methods of categorization.

316

L. Hirsch, M. Saeedi, and R. Hirsch Table 2. Results

Category Name

F1

Precision

Recall

Sci.Space Sci.crypt Sci.Med Soc.religion.christian Alt.atheism Rec.sport.baseball Rec.sport.hockey Rec.motorcyles. Talk.politics.guns Talk.politics.mideast

.949 .908 .838 .841 .729 .679 .751 .808 .816 .730

.974 .872 .815 .860 .679 .650 .787 .720 .890 .730

.926 .947 .862 .823 .788 .710 .719 .920 .754 .730

4 Future Work The regular occurrence of synonyms (different words with the same meaning) and homonyms (words with the same spelling but with distinct meanings) are key problems in the analysis of text data. We are currently trying to assess the usefulness of the Boolean functions OR, AND and NOT to the task of discriminating between texts. We suggest that the OR function may be able to detect synonyms and the AND and NAND functions may be able to detect homonyms by identifying associated words in two N-Gram sequences. We are currently developing the system with the intention of analyzing the use of synonyms and homonyms. We are also investigating the usefulness of new GP functions to insert symbols of the regular expression within the N-Grams, for example a “*” character to mean any sequence of letters. Lastly we are developing a modified fitness test based on the F1 measure.

5 Conclusion We have produced a system capable of discovering patterns of N-Gram co-occurrence in text and storing specifications for those patterns. We have been able to perform a document categorisation task using these specifications and we believe that there may be many other areas within automatic text analysis where the basic technology described here may be of use.

References 1. 2.

Bennet K, Shawe-Taylor J, Wu D.: Enlarging the margins in perceptron decision trees. Machine Learning 41, (2000) pp 295-313 Salton, G., and McGill, M.J.: An Introduction to Modern Information Retrieval, McGrawHill, (1983)

Evolving Text Classifiers with Genetic Programming 3. 4. 5. 6. 7. 8.

9. 10.

11. 12.

13. 14. 15. 16. 17. 18. 19. 20.

317

Joachims T.: Text categorization with support vector machines: learning with many relevant features. In Proceedings of the l0th European Conference on Machine Learning (ECML98) , pp 137-142., (1998) Sebastiani F.: Machine learning in automated text categorization, ACM Computing Surveys, 34(1), pp. 1-47, (2002) Cavnar, W., Trenkle, J.: N-Gram-Based Text Categorization In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994) Pickens, J., Croft, W.B.: An Exploratory Analysis of Phrases in Text Retrieval, in Proceedings of RIAO 2000 Conference, Paris (2000) Koza, J. R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge MA, 1992. Bergström, A., JakseticP., Nordin, P.: Enhancing Information Retrieval by Automatic Acquisition of Textual Relations Using Genetic Programming Proceedings of the 2000 International Conference on Intelligent User Interfaces (IUI-00), pp. 29-32, ACM Press, (2000). Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text, Science, 267 (1995), pp. 843 . 848. Biskri, I., Delisle, S.: Text Classification and Multilinguism: Getting at Words via Ngrams of Characters, Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI-2002), Orlando (Florida, USA), (2002), Volume V, 110115. Tauritz D.R., Kok J.N., Sprinkhuizen-Kuyper I.G.: Adaptive information filtering using evolutionary computation, Information Sciences, vol.122/2-4, pp.121-140 (2000) Langdon W.B.: Natural Language Text Classification and Filtering with Trigrams and Evolutionary Classifiers, Late Breaking Papers at the 2000 Genetic and Evolutionary (2000) Computation Conference, Las Vegas, Nevada, USA, editor Darrell Whitley, pages 210—217. Lodhi H, Shawe-Taylor J, Cristianini N, Watkins C (2001) Text classification using string kernels. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 563--569. MIT Press Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In Proceedings of th the 16 International Conference in Machine Learning ICML-99 (1999). Lang, K.: Learning to filter netnews. In Proc. of the 12th Int. Conf. on Machine Learning, pages 331–339, 1995. Schapire, R., and Singer, Y.: BoosTexter: A boosting-based system for text categorization. Machine Learning, 39, (2000) Slonim, N., and Tishby, N.: Agglomerative Information Bottleneck. In Proc. of Neural Information Processing Systems (NIPS-99), pages 617–623, (1999) Slonim, N., and Tishby, N.: The Power of Word Clusters for Text Classification, 23rd European Colloquium on Information Retrieval Research, (2001) Van Rijsbergen C.J.: Information Retrieval, 2nd edition, Department of Computer Science, University of Glasgow (1979) Yang Y, Liu X.: A re-examination of text categorization methods. In Proceedings of the 22nd Annual ACM SIGIR Conference on Research and Development in Information Retrieval, 42-49 (1999)

Suggest Documents