summaries and full news items including the body of the text â have been categorised ... Keywords: SOFM, Kohonen Map, Text Classification, Training NN, Automatic. Classification, Linear ..... successful for a range of categories, but especially for categories 9 (Foreign Car. Makers) and .... predictors. Technical Report No.
Choosing feature sets for training and testing self-organising maps: A Case Study Khurshid Ahmad, Bogdan L. Vrusias and Anthony Ledford* AI Group, Department of Computing, School of EEITM, University of Surrey, Guildford, UK *Department of Mathematics and Statistics, University of Surrey, Guildford, UK
Correspondence and offprint requests to: Prof. K. Ahmad, Artificial Intelligence Group, Department of Computing, School of Electronic Engineering, Information Technology and Mathematics, University of Surrey, Guildford GU2 5XH, UK
Abstract Statistical pattern recognition techniques, supervised and unsupervised classification techniques being two good examples here, rely on the computations of similarity and distance metrics. The distances are computed in a multi-dimensional space. The axes of this space in principle relate to the features inherent in the input data. Usually such features are chosen by neural network developers, thereby introducing a possible bias. A method of automatically generating feature sets is discussed, with specific reference to the categorisation of streams of free-text news items. The feature sets were generated by a procedure that automatically selects a group of keywords based on a lexico-semantic analysis. Three different types of text streams – headlines only, news summaries and full news items including the body of the text – have been categorised using self-organising feature maps (SOFM) [2]. A method for assessing the discrimination ability of a SOFM, based on Fisher’s Linear Discriminant Rule suggests that an SOFM trained on vectors related to summaries only provides a fairly accurate cluster when compared with vectors related to full text. The use of summaries as document surrogates for document categorisation is suggested.
Keywords: SOFM, Kohonen Map, Text Classification, Training NN, Automatic Classification, Linear Discriminant Rule, Weirdness Coefficient.
1 Introduction Researchers in connectionism regard natural language phenomena, by which they generally mean natural language understanding, with an equal mixture of awe and surprise. For Jeffrey Elman, natural language is ‘one of the most fruitful – and contentious – areas of neural network research’ [1]. Teuvo Kohonen observes that ‘it may sound surprising that vector-space methods such as SOFM can handle structured, discrete, linguistic data’ (1995:247). Indeed, the understanding of continuous samples of linguistic expressions, either as long speech fragments or free-text that naturally occurs in books, journals and news reports, is still a much-coveted goal in cognitive and brain sciences in general, and in artificial intelligence in particular.
There have been many practical applications in connectionist speech processing, especially speech analysis and recognition (see Kohonen [2]). There is some evidence from work in linguistics that word categories (nouns, verbs, adjectives, adverbs, prepositions, etc.) may be inferred from the statistical occurrences of words in different contexts. For Kohonen and his colleagues, ‘“context patterns” consist of groups of contiguous symbols’; the authors cite pairs or triplets of words in a sentence as an example of such patterns. Such pairs or triplets are then used as inputs in the training and testing of self-organising feature maps (SOFMs) with the resulting SOFM regarded as a contextual map. Kohonen [2] has shown that a SOFM-trained word context pairs, derived from 10,000 random sentences, shows ‘a meaningful geometric order of the various word categories’ (p.207). More recently, the categorisation power of neural networks, particularly those based on the competitive learning laws, has been demonstrated by WEBSOM. WEBSOM has been variously described by Kohonen as a
3
scheme, content-addressable memory, method and architecture. WEBSOM is a twolevel self-organising feature map comprising a word category map and a document category map, which has been used to classify newsgroup discussions, full-text data and articles in scientific journals (Kohonen [3], Kaski et al. [4]).
The news report is one of the most commonly occurring linguistic expressions. Despite being a good example of open-world data, a news report is a contrived artefact: each report has a potentially attention grabbing headline; the opening few sentences generally comprise a good summary of the contents of the report; there are slots for the date of origin and slots for photographs and other graphic material. This contrived artefact is highly focused and highly perishable, and usually contains references to one or more persons, places, events or actions. Given the growth of telecommunications networks, it is possible to have almost non-stop news streaming in. The reports in the news streams have to be categorised in order that they may, for example, be routed to those who declare an interest in a set of (related) categories. Automatic categorisation of news stories is of substantial interest to defence/intelligence agencies (Mani [5]), to information retrieval communities, and to major news vendors supplying on-line news.
This paper outlines a method for generating feature sets for texts that are then classified using Kohonen’s self-organising feature maps. The 100 texts were drawn from a test-sample which was created for evaluating the effectiveness of text categorisation systems by the US government’s Defence Advanced Research Projects Agency for its TIPSTER Text Programme in 1997. The SOFMs were able to categorise the texts, full news reports and summaries thereof, in 10 different classes without a priori knowledge of the categories.
4
2 Text Categorisation Systems There are at least two groups of researchers that are using neural networks for text categorisation. The first group, which includes David Lewis and colleagues, is using supervised learning algorithms for training neural networks, particularly WidrowHoff’s error-correction learning paradigm and its recent variant, the exponentiated gradient method, due to Kivinen and Warmuth [6]. Lewis [7] used the TREC data set, containing Associated Press (AP) newswire texts and a collection of medical abstracts, to compare Widrow-Hoff and exponentiated-gradient methods with a conventional information retrieval technique (the Rocchio learning algorithm). Lewis has shown that both the neural network algorithms outperform Rocchio's classifiers for a whole range of documents: medical abstracts, news headlines and reports, and abstract titles. They also showed that the F-measure, a weighted combination of precision and recall, was consistently better for the full body of the text as compared to the headline.
The second group uses self-organising feature maps (SOFMs), particularly Kohonen feature maps, to execute a range of linguistic tasks including text categorisation. Kaski and colleagues have attempted to create ‘an order in digital libraries with selforganising maps’ [4]. They have used a two-level self-organising map comprising two hierarchically interrelated SOFMs. The first is a word category map, which describes relations of words based on their averaged short contexts and the document map. A histogram of hits is then produced by first mapping documents onto the word category map to create a document map.
The histogram is then used as a document
‘fingerprint’. A good degree of success has been reported with low content news
5
streamers, like newsgroups, with the result that the streamers could be browsed much more easily. WEBSOM has been used to cluster a range of text materials, including full-text documents [8], abstracts of scientific articles, patent abstracts and news items [9], with some degree of success.
Classification of news reports possesses a significant challenge in that such reports deal with a variety of interlinked specialist topics, for example, terrorism, financial instruments (such as currencies, bonds and derivatives), ecology, politics, drug enforcement, trade and industry. Some reports may be focused on events, whilst others deal with people and places. Some information in a news wire may be text internal, like values of key indicators, casualty figures, contraband types and quantities, but other information may require external reference to previous events, current and past celebrities, etc. Essentially, there may be keywords (e.g., drug cartels, eurodollar, global warming) in the news reports which, in turn, may be used to categorise such reports. However these keywords can change (e.g., eurodollar by Euro, etc.). What is required here is a dynamic classification scheme that is flexible enough to adapt alongside novel topics which might change rapidly, but also be able to deal with related keywords that might last for months on end; for example, Clinton’s misdemeanour (and) impeachment might last for several months but earthquake disasters may be news for only a few days. Nevertheless, sets of news reports delivered by a news wire service may yet be categorised by common keywords, proper nouns of people, places and organisations.
One well-recognised way of describing news reports is to classify the texts as a distinct register or genre of writing. The term register is used to indicate that the language
6
within a specialised field differs from that of general language or language of everyday use, at lexical, syntactic and semantic levels. A large collection of general language text may thus be contrasted with a set of specialist reports at various linguistic levels, including lexical and semantic.
We describe below a method of semi-automatically identifying the terms of a set of specialist domains. This method involves comparing the frequency of systematic terms in a collection of specialist texts sometimes called a corpus, with the frequency (or absence) of the terms in a carefully compiled corpus of general language texts. Each term can be construed as a dimension in a vector space and the presence or absence of a term within a text is then used to allocate the text its position within the vector space.
3 Generating Feature Sets for Text Categorisation Consider a set of texts that may have been selected according to certain criteria: for instance, all texts streaming along a news wire over a short period of time comprising news related to specialist topics – like environmental news or economic news. Such a set is usually called an environmental text corpus or economic text corpus. If a text corpus deals with a specific topic then one will find terms related to the topic will occur quite frequently. However, if one collects texts streaming over a long period of time, irrespective of topics, then there is little likelihood of finding very high frequency terms associated with only one topic.
It is a well documented fact that whether or not a text is specialist or general language, the so-called closed-class words, like the determiners (‘a’, ‘an’ and ‘the’ in English),
7
prepositions, conjunctions and certain verbs, may comprise as much as 25% of the given text. The relative frequency of closed-class words (or grammatical words or stoplists) in specialist and general language texts is usually the same, it is the nominals (most terms are either single nominals or compound ones) that occur with different relative frequencies (the relative frequency of a word in a text is the ratio of the absolute frequency of the word to the total number of words in a text or corpus). A second ratio, that of relative frequency of the same term in a specialist text to that in general language texts can be used to quantify the lexical difference between the two texts. This ratio has been termed weirdness to indicate how it measures the preponderance of words in specialist texts that would be unusual in general language, Ahmad [11].
3.1 Training and Test Data Our text corpus consisted of 100 Associated Press (AP) news wires selected from 10 pre-classified news categories. The average length of the articles was 622 words. These articles were summarised to compare the effect of summarisation on the ability of the system to categorise and the summariser to retain information important to the main theme of the article (for more information on our automatic summarisation technique see [12]). After summarisation the average length was reduced to 236 words. Finally, from each article we also separated out the news headline which had an average length of 10 words.
This AP news corpus had previously formed a subset of the documents used to assess the accuracy of automatic summarising procedures [5]. As part of the testing
8
procedure, a group of human assessors marked each of the full-texts and summaries for relevance to the predefined category, as a measure of summary effectiveness. Unknown to both the summarisers and assessors a few irrelevant texts had been included in each category. The ‘relevance’ as judged by the assessors for the full-texts was of particular interest to us. In cases where an assessor thought a text was relevant but the ‘true’ classification was irrelevant (False Positive) one should not be surprised if the system clusters the text with others within the same category. However, when an assessor considered a text to be irrelevant in agreement with the ‘true’ classification (True Negative) one should expect the text not to cluster. How this categorisation affected our system will be discussed later.
Typically, before text documents are represented as vectors in order to act as the input to a text categorisation system, pre-processing takes the form of filters to remove words ‘low in content’ from the text (see the WEBSOM method [9]). We remove punctuation, numerical expressions and closed-class words as a precursor of generating the feature set. Vectors representing news texts were created on the basis of a lexical profile of the training set of texts. This lexical profile was determined by two measures: the frequency of a term; and, a weirdness coefficient describing the subjectspecificity of a term, where:
weirdness coefficien t =
relative frequency of term in a specialist corpus relative frequency of term in general language
(1)
The feature set was created by first selecting the top 5% (10% for headlinesi) most frequently occurring words, and from this set, by choosing the words with the highest weirdness coefficient. Subsequently, the 50 most frequent words are selected, excluding spelling mistakes, and numerical expressions and terms too infrequent to
9
provide consistency within a domain are avoided. A high value for the weirdness coefficient is indicative of a word which is uncommon in general language but common in the specialist corpus under examination and is thus a good candidate for a domain term or other word specific to that genre. By disregarding words with a weirdness coefficient lower than a threshold, many closed-class words and other terms common in general language are automatically removed.
The 100 AP news wires comprised over 56,000 words. System Quirk [11], a text analysis system developed at the University of Surrey, was used to compute frequency distribution of words in the AP News wire corpus. The System also has access to the frequency distribution of words in the British National Corpus [13], a carefully compiled general language corpus. Some of the high weirdness terms, e.g., drug, taxes, pollution and environmental are important keywords, but the same cannot be
said for ‘terms’ like billion, percent and federal. Usually, proper nouns are also flagged as terms by this method.
A separate feature set was generated for all full texts, all summaries and all headlines, using the feature words extracted from each separate corpus. For example, in generating feature words for the headlines the weirdness coefficients were calculated across the collection of headlines only.
The feature words identified for each of the three texts are shown in Table 1. The full texts and summaries share many common feature words, giving us confidence that the summaries preserve the content of the news, in particular the main domain terms. Headlines, however, share approximately a third less feature words with the full texts.
10
This hints, even at an early stage, that the content of the headlines differs from that of the other two forms of news item.
Having identified the feature set the training vectors for each of the texts could then be generated. Each vector consisted of binary values indicating the presence or not of each of the feature words determined above, that is, y ( x) =
1 if x > 0 0 if x = 0
(2)
This resulted in three sets of 100 vectors for each of the articles in each of the scenarios that encoded our knowledge of the genre of each of the texts.
Consider one of the full AP-News stories about terrorism, together with its headline. The feature vector for the full text is: percent
tax
billion
drug
reagan
cars
taxes
environmental
pollution
…
1
0
1
1
1
0
0
1
0
…
and for the headline it is: tax
scientists
reagan
bill
energy
drug
ap
nuclear
administration
…
0
0
0
0
0
1
0
0
0
…
11
3.2 Training Kohonen Feature Maps We have developed a system for creating Kohonen Feature Maps (SANC: Surrey Artificial Network Classifier). The algorithms, which are used in training, have been reported extensively in the literature (see, for example, Kohonen [2]). The system, after having trained an SOFM, is also capable of testing it. There are facilities to vary the key parameters associated with the Kohonen Map including the learning rate (α), the neighbourhood size (γ), and the decay factor (δ). The system also allows the trained SOFMs to be tested. There is a graphical user interface that can be used to plot the winners and associated weights in the output layer. Furthermore, the system allows the storage of previously trained maps for reference purposes. Figure 1 shows the interface of our system, which displays a trained map.
Figure 1. SOFM system developed at the University of Surrey. The numbers on the individual neurons in the output layer refer to the class of the input vector; the user can also use a colour scheme to distinguish between various classes.
12
SANC system uses n-dimensional vectors as inputs and produces a visual twodimensional output map, with details of the inputs class and position on the map. The system can be configured to perform many kinds of training with different settings each time. Initially the settings are arranged to work as default with the most commonly used training values, but the user can change them as he or she wishes (Figure 2).
Figure 2. Parameter settings in SANC.
3.2.1 The Training Regimen The two-dimensional Kohonen feature map used in our experiments employed the self-
organising feature-mapping algorithm to organise news texts on a 15 x 15 map. The size of the input layer was determined according to the number of feature words extracted under each scenario.
SOFMs learn through the so-called winner-takes-all competition: the weights of the winning neuron j, which responds to the input i, denoted as wij(t), are changed in the t+1 training cycle, according to: wij(t+1) = wij(t) + α(t) * γ(t) * (xi – wij(t)) 13
where α(t)
γ ( t)
denotes the learning rate; and, is related to the size of the neighbourhood of the winning neuron j, and is given as γ(t) = exp ( δ(t) * (rij/ σ(t)2)
where δ(t)
is the decay factor, usually given a fixed value of –0.5. However,
sometimes other decay factors are used as well; rij
is the distance between current and best match unit node; and,
σ(t)
is the so-called neighbourhood value, σmax is usually N/2; and,
N
is the total number of neurons in the output layer.
rij is the distance between the best match node (winner node) in the output layer found
in the previous training cycle, say t-1, and the current node in the output layer which may or may not be the same previously best match node. If the current and previous nodes are the same then the weight change is maximal – that is, the same node that was found to be the winner in both cycles, and hence has the largest change of weight.
SOFM literature suggests that the learning rate (α) and the neighbourhood value (σ) may be changed for every training cycle (t) to make the learning more efficient. Typically values of parameters can be kept constant or change according to exponential or linear decrement. There are two possibilities. The first is that the training should begin with a large value of α and σ , respectively denoted by αmax and
σmax, and during tmax training cycles these values are reduced to a pre-selected minimum value αmin and σmin:
α(t) = αmax (αmin/ αmax) (t / tmax)
14
and
σ(t) = σmax (σmin/ σmax) (t / tmax). The second way the parameters may be changed is to affect the decrement linearly:
α(t) = αmax (t / tmax) σ (t) = σmax (t / tmax) A simpler scheme may be to change the value of only one of the parameters (α and σ): the learning rate should be kept whilst the neighbourhood value should be reduced or vice versa. In essence, there are five interesting permutations of the above-mentioned heuristics that can be used to effect a change in the learning rate and the neighbourhood value.
The output activation of the neuron j can be computed using a distance metric dj. Kohonen suggests using either the so-called Euclidean distance or the so-called Dot Product. We used both distances, and the categorisation produced in Euclidean distance computation is very similar to the Dot Product distance output. 3.2.2 Parameters and decay schemes in training Kohonen maps We have tried six different permutations of the learning rate (α) and the
neighbourhood parameter (σ).
15
Figure 3a: Decay Scheme 1
Figure 3b: Decay Scheme 2
Figure 3c: Decay Scheme 3
Figure 3d: Decay Scheme 4
Figure 3e: Decay Scheme 5
Figure 3f: Decay Scheme 6
16
Figures 3(a)-(f) show the effect of each of the permutations respectively. The maps created after training relate to the use of vectors derived from full text in the categorisation of the 100 newswires. Feature maps track according to 10 categories; each having its own icon. 3.2.3 Applying the Discriminant Rule on the Output Maps The Kohonen feature maps (Figure 3a-f) show how the various news streams were
categorised. There is a discernable difference between the maps: an eye inspection tells us about how effective the maps are in categorising different news stories. Once we have established an effective training scheme (neighbourhood and learning rate changing exponentially), we then compare how the Kohonen SOFM categorises full text, summaries and headlines.
This section details a variation of the Fisher’s Linear Discriminant Rule [14] that aims to account for the effect of different classes lying on the same coordinate point on the map. The fundamental idea is to quantify the distribution of the ensemble of groups relative to a metric that is invariant to location, scale, rotation and reflection transformations. This is achieved by calculating: Q=
where d B
is the eigenvector corresponding to the largest eigenvalue of W-1B is the in-between covariance matrix and is given by B=
W
d ' Bd d ' Wd
m i =1
(
)(
)′
n * i X i. − X .. X i. − X ..
is the within covariance matrix and is given by W =
m
ni
i =1
j =1
(X
ij
)(
− X i. X ij − X i.
17
)
′
where X i.
is the mean of class i and is given by X i . = ni−1
X ..
ni j =1
is the mean vector and is given by ni
m
X .. =
i =1
j =1
X ij
m i =1
ni*
X ij
ni
denotes the number of coordinate associated within class i, that are not associated with any other class.
Values of Q calculated for each supplied dataset are shown in Table 4.
The results for Decay Schemes 1 and 2 suggest that these networks have high diagnostic ability. This means that the most appropriate way to decrease the neighbourhood value is exponentially. Also, although the learning constant does not seemingly affect the final output of the map, it is preferable to keep it at either a low value or to decrease it exponentially (see Table 2).
We decided to train the Summary and Headline Texts using the second Decay Scheme (exponential decrement for both neighbourhood value and learning rate) as it performs well for any number of cycles, rather than Decay Scheme 1, which needs a large number of cycles to work better. 3.2.4 Categorising texts by using Headlines, Summaries and Full Text Results of categorising full-texts: The results of the Kohonen classifications for full
texts are shown in Figure 4a. Using symbols to represent each of the locations of the
18
‘winning node’, the position of each text is indicated across the two-dimensional map (shown in Table 3). It can be seen that the quality of clustering for the full-texts is successful for a range of categories, but especially for categories 9 (Foreign Car Makers) and 10 (Worldwide Tax Sources). Patterns in categories 1 (Bioconversion), 4 (Fossil Fuels), 6 (Exportation of Industry) and 8 (International Drug Enforcement) are also effectively grouped together. The widespread distribution of Class 5 (Rain Forests) shows it to be the worst class on the map.
Figure 4a. Results of a Full Text Map trained using exponentially decreased neighbourhood and learning rate.
Results of categorising summaries: Topological ordering is also successful when
training a Kohonen map on the texts following automatic summarisation, as illustrated in Figure 4b. The Discriminant value (Q) for the Summaries is 1.17, which is considered reasonably good, in comparison with the Full Text Q value. The difference between the results for full-texts and summaries is small, with patterns in categories 8 and 10 experiencing a high degree of clustering for both. The degree of clustering for patterns in other categories is similar for full-texts but varies for specific categories.
19
Figure 4b. Results of a Summary Text Map
Results of categorising headlines: The ability of the Kohonen map to cluster patterns
representing headlines is significantly worse than for either the full texts or the summaries. The Discriminant value (Q) for the Headlines is 0.30, which is considered very low, in comparison with the Full Text and Summaries Q value. The sparse map shown in Figure 4c illustrates this poor clustering. Many of the patterns are represented on the same node indicating that the self-organising feature-mapping algorithm was unable to distinguish between patterns from various categories, due to the minimal number of words on each Headline Text.
Figure 4c. Results of a Headline Text Map
20
These results for the trained Kohonen map were similar across a number of trials despite variations in training method and learning rate used. Firstly, full texts and summaries categorised significantly better than headlines. Secondly, some categories, for example 10, clustered consistently better than others for instance 5. By simply counting the number of feature set words that appear in at least nine of the ten texts of each category, the best clustered categories are guaranteed to have some of these words. This reflects the tendency of these categories to cluster well. On the other hand, for a category in the ‘best’ case, only four of the texts share a common feature set word. This difference in classification difficulty was also seen in the TIPSTER results [5] from two human assessors.
It seems the summary preserves enough information for successful categorisation by feature extraction. From this we conclude that the summaries contain sufficient parts of the main theme of the text, whereas the headlines do not. This is important as our feature extraction and summarisation methods extract features differently (weird words vs. lexical cohesion).
Returning to the issue of irrelevant texts within the categories used in the TIPSTER competition, we examined the assessors’ marks and found that in most cases the assessors disagreed; there was a mixture of False Positives and True Negatives. There were only a few texts for which all marks were True Negative. This meant we could not use the TIPSTER irrelevancy scores to say a text should not cluster since the human assessors themselves were confused. Further investigation of the ‘irrelevant’ texts themselves revealed that the ‘irrelevancy’ of the articles was due to the fine definition of the categories. In particular, in category 10 (Worldwide Tax Sources),
21
which contained essentially all examples of unanimous True Negatives, all the ‘irrelevant’ articles were about US tax, whereas the ‘relevant’ articles were about tax outside the US. We were not surprised that in our categorisation system, which relies on a small set of single word terms, such a subtle difference in categorisation was not extracted.
4 Conclusions and Future Work This work illustrates how a simple method of document encoding, which involves little human intervention, is suitable for encoding different forms of text material despite differences in their length and the way in which the contents are presented. A selforganising neural network using this method of document encoding can successfully classify both full-text news documents and summaries of those documents. The technique involves the extraction of words that are both frequent and ‘weird’ from a corpus.
In order to improve the clustering of text documents, one might consider increasing the number of feature words, by decreasing the frequency and weirdness coefficient thresholds, although this increases the possibility of noise entering the inputs. Alternatively we suggest replacing single word terms by a combination of single and compound terms. We believe that this will extract terms that are not weird in themselves, but have a high weirdness coefficient if appearing as part of a domainspecific compound term. For example, the single words ‘foreign’ and ‘trade’ may well not be weird words individually but if the phrase ‘Foreign Trade’ appears in a newsdocument with a high frequency then it would be an important feature.
22
The use of summaries as document surrogates is of significant import for document categorisation in terms of computational efficiency, including processing time and storage. The work related to the WEBSOM architecture shows that short documents, like newsgroup communications, could be categorised fairly accurately using a twolevel architecture. Our work supports the claim that document surrogates, like summaries of full documents, can be used in categorising full documents using selforganising feature maps, albeit unlike WEBSOM, we make do with a single-level architecture.
23
References 1.
Elman, J.L. Language Processing. In: Arbib M.A. (ed.) The Handbook of Brain Theory and Neural Networks, 1995, pp. 508-513. The MIT Press.
2.
Kohonen, T. Self-organizing maps. Berlin, Heidelburg and New York: SpringerVerlag, 1995. (The second edition, published in 1997, has a separate section on WEBSOM).
3.
Kohonen, T. Exploration of very large databases by self-organizing maps. In Proceedings of ICNN’97, 1997, pp. PL1-PL6, IEEE Service Center, Piscataway, NJ.
4.
Kaski, S, Honkela, T, Lagus, K & Kohonen T. Creating an order in digital libraries with self-organising maps. In Proc. WCNN’96, World Congress on Neural Networks, 1996, pp 814-817. Lawrence Erlbaum and INNS Press.
5.
Mani, I. The TIPSTER SUMMAC Text Summarization Evaluation. Mitre Technical Report: MTR 98W0000138, 1998.
6.
Kivinen, J & Warmuth, MK. Exponentiated gradient versus gradient descent for linear predictors. Technical Report No. UCSC-CRL-94-16, 1994. Santa Cruz, Basking Center for Computer Engineering and Information Sciences.
7.
Lewis, DD. Evaluating and optimising autonomous text classification systems. In SIGIR 95: Proc. of the 18th Annual ACM-SIGIR Conference on Research and Developments in Information Retrieval. 1995, pp 246-254.
24
8.
Honkela, T, Kaski, S, Lagus, K & Kohonen, T. Exploration of full-text databases. In Proc. of the ICNN’96, International Conference on Neural Networks, 1996; 1; pp. 5661, IEEE Service Center, Piscataway, NJ.
9.
Lagus, K. Generalizability of the WEBSOM method to document collections of various types. In Proc. Of 6th European Congress on Intelligent Techniques & Soft Computing (EUFIT’98), 1998, 1, pp. 210-214, Aachen, Germany.
10. Salton, G. Automatic text processing: The transformation, analysis and retrieval of information by computers. 1989, Reading (Mass): Addison-Wesley Pub Co. 11. Ahmad, K. Pragmatics of Specialist Terms and Terminology Management. In (Ed.) Petra Steffens. Machine Translation and the Lexicon. 1995, pp. 51-76. Heidelberg: Springer. 12. Benbrahim, M. & Ahmad, K. Text summarisation: the role of lexical cohesion analysis. The New Review of Document and Text Management, 1995; 1, Taylor Graham Publishing. 13. Aston, G. and Burnard, L. The BNC Handbook: Exploring the British National Corpus with SARA. 1998. Edinburgh: Edinburgh University Press. 14. Chatfield, C. and Collins, A.J. Introduction to Multivariate Analysis. 1980, Cambridge University Press.
25
Figures
Figure 1. SOFM system developed at the University of Surrey. The numbers on the individual neurons in the output layer refer to the class of the input vector; the user can also use a colour scheme to distinguish between various classes......................................................................... 12 Figure 2. Settings window of the SOFM system. ............................................................................ 13 Figure 3a: Decay Scheme 1 ........................................................................................................... 16 Figure 3b: Decay Scheme 2 ........................................................................................................... 16 Figure 3c: Decay Scheme 3 ........................................................................................................... 16 Figure 3d: Decay Scheme 4 ........................................................................................................... 16 Figure 3e: Decay Scheme 5 ........................................................................................................... 16 Figure 3f: Decay Scheme 6............................................................................................................ 16 Figure 4a. Results of a Full Text Map trained using exponentially decreased neighbourhood and learning rate. .......................................................................................................................... 19 Figure 4b. Results of a Summary Text Map ................................................................................... 20 Figure 4c. Results of a Headline Text Map..................................................................................... 20
26
Tables FULL TEXTS percent tax billion drug reagan cars taxes environmental pollution fuel federal dukakis bush congress mexico emissions drugs fuels senate auto proposal gasoline exports vehicles ohio
SUMMARY TEXTS
greenhouse dioxide marine mazda gases shale deficit export recycling epa honda methanol automakers panama corp forests cocaine enforcement warming smog ozone massachusetts imports automobile trafficking
percent million tax billion drug taxes pollution cars administration reagan environmental dukakis fuel federal bush emissions congress drugs auto senate mexico greenhouse marine nations japan
shale dioxide gasoline fuels recycling controls proposal ohio deficit vehicles tuesday gases trafficking mazda exports export automakers epa honda sewage tropical ozone sulfur methanol cocaine
HEADLINE TEXTS tax scientists reagan bill energy drug ap nuclear administration congress bush law bjt greenhouse sanctions taxes fuel cars alternative leader percent rain plants report dukakis
laserphoto incinerator panama sweeping drugs environmental climate wire votes washington parliament agreement treatment forest countries trade car group dea epa recyclable recycler hitches reportedly congressmen
Table 1: Feature words identified for three specific text types
Decay Scheme
Parameter Learning Rate
1 Fixed Low Exp
2
3
Exp
Exp
4 Linear
6 Exp
Fixed Linear Fixed Low High Table 2: A scheme for varying learning rates and neighbourhood values for training SOFMs.
Neighbourhood value
Exp
5 Fixed High Exp
TREC categories 1
Bioconversion
6
Exportation of Industry
2
Pollution Recovery
7
Foreign Trade
3
Alternative Fuels
8
Int. Drug Enforcement
4
Fossil Fuels
9
Foreign Car Makers
5
Rain Forests
10
Worldwide Tax Sources
Table 3: Text categories used in the TIPSTER – SUMMARY program [5]
Decay Scheme Q-value
1 2.40
2 2.22
3 2.21
4 1.75
Table 4: Values of Q for each dataset
27
5 1.64
6 0.94
Symbols W
weights
δ(t)
decay factor (usually given a fixed value of –0.5)
r
the distance between current and best match unit node
α(t)
learning rat
σ(t)
neighbourhood value
N
is the total number of neurons in the output layer.
t
current training cycle
Q
variation of the Fisher’s Linear Discriminant Rule
d
is the eigenvector corresponding to the largest eigenvalue of W-1B
B
is the in-between covariance matrix
W
is the within covariance matrix
X i.
is the mean of class i
X ..
is the mean
ni*
denotes the number of coordinates associated with class i that are not associated with any other class
Xij
is the vector of object i belonging to class j.
i
Using 5% for headlines gave a feature vector length that was almost half of that for full texts and summaries. As we expected headlines to give a poorer classification we increased this to 10% to give approximately the same length vectors for all cases. We note that this is the only human intervention required in the method.
28