EA
PUBLISHING HOUSE OF THE ROMANIAN ACADEMY
PROCEEDINGS OF THE ROMANIAN ACADEMY Series A Volume 1, Number 2/2000, pp.117-127
AUTOMATIC CLASSIFICATION OF DOCUMENTS BY RANDOM SAMPLING
Dan TUFIŞ*, Camelia POPESCU**, Radu ROŞU***
* Romanian Academy RACAI, 13, "13 Septembrie", 74311 Bucharest, Romania,
[email protected] **TR&D, Av. Pierre Marzin 22307 Lannion cedex France,
[email protected] ***Dept. of Statistics, Univ. of North Carolina, Chappel Hill, USA,
[email protected] Abstract. This paper presents a thesaurus-based approach to document classification. We define a classification space based on the notion of theme vectors. For a new text, we compute its characteristic vector by considering only a sample of randomly extracted lemmas. Then, we compute the differences between this vector and the vectors in the document model and the classification of the new text is decided based on the closest vector. We introduce a family of document classifiers depending on a parameter, and present a statistical procedure to evaluate their effectiveness for different sized corpora. We show that they have statistically distinct behavior so that it makes sense to look for an optimal value of the classifier parameter. We suggest that our method can also be used in comparing different ontologies with respect to their support in document classification and that the same method can be used in assessing corpora homogeneity. Key words: corpus linguistics; document classification; evaluation, lemmatization; natural language processing; POS tagging; thesaurus; vector spaces; word sense clustering.
1.
INTRODUCTION
Automated text classification is a challenging problem of a great practical importance. The challenge comes from the fact that out of the proposed solutions none is perfect: if the classification engine is accurate, probably it is either domain/ontology specific, or has heavy requirements with respect to the language resources, or is black-boxed for commercial reasons. In the vast majority of cases, a good text classification engine is domain specific and was built from large in-house language resources and is a black-box. The practical utility of automatic text classification is motivated by the tremendous amount of textual information ever growing on the web. Searching for a needle in a haystack tends to be an obsolete comparison. Having means to structure this "information jungle", one may turn the majority of hidden documents (and thus useless) into active sources of knowledge, entertainment or ideas, fulfilling this way the purposes they were published on the web. Automatic text classification is also relevant for various "infrastructural" levels in natural language processing. Biber has shown [1] that the accuracy of a probabilistic tagger is related to the domain on which it was trained and on which it is
used. By combining multiple domain-diversified language models/classifiers, Tufiş has demonstrated [12] a significant accuracy improvement of a tagger performance. The same is the case for word-sense disambiguation and parsing. Sekine discusses [11] various differences on the recall and precision of a probabilistic parser considering the different domains for the training and test data. Similarly, Karlgren argues [4] that in information retrieval applications the effectiveness of a search (retrieval accuracy) is strongly determined by the matching of the style/register/domain of the training data used in building the search engines and of the documents to be retrieved. Kilgarriff discusses at large [5] the significance of comparing corpora, the merits and the drawbacks of the proposed methods in text classification. Learning to automatically classify textual documents is a typical machine learning task, defined as assigning category labels (pre-defined) to new documents based on the likelihood suggested by a model induced from a large training set of labeled documents [6], [7], [10] etc. Recently, in [9] has been proposed a method based on EM to boost the learning process by using both labeled and unlabeled documents. In [3] there are discussed various supervised classification algorithms and they are analyzed in the context of a taxonomy of different types of
Dan TUFIŞ, Camelia POPESCU, Radu ROŞU
statistical questions that arise in machine learning (Figure 1). Multiple domains
single domains analyze analyze classifiers algorithms predict choose predict choose accuracy classifier accuracy algorithm Large sample Small sample
1
3
5
7
2
4
6
8
9
Figure 1: A taxonomy of statistical questions in machine learning.
The taxonomy is developed along 4 binary criteria: a) whether the problem solving considers single domains or multiple domains, b) whether the analysis concerns learning algorithms or classifiers based on a given algorithm, c) whether the analysis aims at predicting accuracy or at making a choice, d) whether the training data is represented by large or small samples. The meaning and relevance of each case in Figure 1 is presented in [3]. We resume below the presentation of the taxonomy criteria: a) the distinction between single or multiple application domains: in most applied research, there is a single domain of interest, and one's goal is to find the best classifier or the best learning algorithm to apply in that domain. b) the distinction between analyzing classifiers or algorithms: a classifier (aware of a K-class classification system) is a function that, given an input example, assigns that example to one of the K classes. A learning algorithm is a function that, given a set of examples and their classes, constructs a classifier. In a particular application setting, one's primary goal is usually to find the best classifier and estimate its accuracy on future examples. c) the distinction between predicting accuracy (of a classifier or a learning algorithm) and making a choice (of classifiers or algorithms): this distinction raises the issue on how can we measure the accuracy. And of course, after evaluating the accuracy, one would like to choose the best classifier from some set of available classifiers. d) the lowest level of the taxonomy concerns the amount of data available: if large amount of data is available, then one can set some of it aside to
serve as a test set for evaluating classifiers. In case the training data is limited (most frequent situation) one needs to use all of it as input to the learning algorithms. This means that some form of resampling (i.e., cross-validation or bootstrap) of the data is needed in order to perform the statistical analysis. In his paper, Dietterich [3] addresses the question 8 (single-domains & analyze-algorithms & choose-algorithms & Small sample; see Figure 1). This paper addresses the questions 1 and 3. Document/text type is sometimes referred to (quite informally) as document/text domain. One should be aware of the possible confusion between application domain and document/text type. Our experiments fit into the single domain part of the taxonomy: the domain is document/text classification (into such classes as finance, tourism, fiction, etc). However, as we will show, the approach is not restricted to this single application domain, and it can be used also for other domains such as assessing the usefulness of competing ontologies or comparing corpora. We have used a learning supervised procedure to construct a family of classifiers depending on a parameter. Our primary goal is to find the best classifier and estimate its accuracy for future examples. Fortunately, we had at our disposal a large amount of data, separated into a training set, used to construct the classifiers, and test samples. Now, after reviewing the general structure of the taxonomy let's consider the questions concerning this paper (1 and 3). Question 1: Suppose we are given a large sample of data and a classifier C trained to label an unseen text into one of the categories of interest. Can we estimate the accuracy of C and its error probability? Since we have large samples of training data, we can afford to set aside some of it for testing purposes, so the answer to this question is obviously positive and needs less elaboration. Question 3: Given a family of n classifiers C1, C2, …Cn, can we estimate which classifier will be the most accurate in the face of unseen texts? To answer this question we consider a family of classifiers depending on a single (discrete) parameter and show that the considered classifiers are statistically distinct (i.e. one can statistically differentiate them). Therefore, it makes sense to look for a value of the parameter that maximizes the accuracy of a classifier for the test set. This classifier will have the best-expected accuracy on new data.
Document classification by random sampling and vectorial projection
2.
PROBLEM DEFINITION
The main linguistic resource for our classifier is the thesaurus of France Telecom. Its structure is based on the following idea: the meanings of a word/lemma are represented by concepts, and the concepts are clustered into themes. The themes are classes in the set of concepts and represent the axes in our vector space. Human experts decided on the distinction between the meanings of words/lemmas as well as on their clustering into themes. The thesaurus has all its concepts grouped in more than 800 themes. From this thesaurus we have extracted the corresponding themes for every pair (the pairing was necessary because of possible lemma homography). In our experiments we considered (for the time being) only the nominal lemmas. The classification algorithm generates for every text document a representation in the vectorial space. To construct the classifiers we took the following preliminary steps: 1) the corpus was lemmatized so that each token was associated with its lemma and grammatical category; 2) all different (noun) lemmas, Li (i=1,n), were extracted and associated with their occurrence frequency Occi (i=1,n) 3) by looking up the thesaurus, each lemmacategory pair was associated with the corresponding list of themes THik (k=1,ni). The list THi={THik | k=1,ni} represents all the distinct meanings of the lemma Li. The length (ni) of the list THi represents Li's ambiguity degree. As a short form for saying that a wordform w with lemma Li has, in a specific context, the meaning THik we say that Li belongs to THik. For brevity also, we say that a lemma Li belongs to several themes THi1, THi2,… THiq to mean that the list of meanings of Li is (THi1, THi2,… THiq). In building the text classifiers, our algorithm relies on the following hypotheses: (H1) the (relative) frequency of each theme is specific for a corpus; we named it theme score. (H2) for every lemma that belongs to several themes, the probability of belonging to a theme is proportional with the frequency of the theme in the corpus (the theme score). Let Ci be a coefficient specific to Li depending on the lemma (relative) frequency and its degree of ambiguity: Ci = f (Occi, ni); From the hypothesis (H1) and (H2) we derive the score of a theme THj (j=1,m) appearing in the meaning lists of various Li by the relation below:
The solution of this system could be approximated by an iterative method. For the n-th iteration we calculate the scores of THj by using S (THj ) =
Ci * S (THj ) i =1,m å S (THik )
å
(1)
k =1,ni
the scores from the (n-1)-th step: Sn(THj ) =
Ci * Sn − 1(THj ) i =1, m å Sn − 1(THik )
å
(2)
k =1, ni
where S0 (THj)=1 (j=1,n). The fix point of this system is an approximated solution for (1). The scores from the first step of iteration: represent the case when all the themes of an ambiguous lemma have the same probability. Using these scores is equivalent to changing the S1(THj ) =
Ci i =1, m ni
å
(3)
hypothesis (H2) with (H2*): (H2*) for a lemma Li that belongs to several themes, the probability of belonging to a theme is proportional to 1/ni, where ni is the degree of ambiguity of the theme. Depending on the way the scores were computed (i.e. the hypotheses we relied on) two types of classifiers were constructed: - iterative classifiers, based on hypotheses H1 and H2 and scores computed by the recursive formula (1) - simple classifiers, based on hypotheses H1 and H2* and scores computed with the formula (3) The proper classifiers are constructed by using the vectorial method briefly sketched below. New document
Tourism1 Tourism2
Finance
Figure 2: A graphical representation of the domain modelling in a multidimensional space
For a set of pre-classified documents we compute the corresponding theme vectors. The
Dan TUFIŞ, Camelia POPESCU, Radu ROŞU
theme vectors of all the documents belonging to the same domain, define a cone (in the 856dimensional vectorial space of themes) that is taken to be the representative of the respective domain. The Figure 2 tries to suggest the n-dimensional domain specific cones defined by the characteristic vectors of the training documents. For our experiments, we defined 3 domains: Tourism1, Tourism2 and Finance. An ndimensional cone in the theme vectorial space models each domain. Each model was constructed by computing the characteristic vectors of 1000 documents (web-pages) previously manually classified for the respective domain. A new document is represented in the same way, and its theme vector is checked against the cones of the predefined domains, via a measure of closeness. The document is classified in the domain the cone of which contains or is closest to the theme vector of the new document. In figure 2, the new document is definitely a Tourism one and is more likely to belong to Tourism1 than to Turism2. Further we study the choice of the coefficient Ci. The first case is Ci = Occi where the only hypothesis considered are (H1) and (H2). We have experimented on two corpora composed of Web pages from two different domains (finance and tourism1+tourism2). The lemma distribution of the first corpus is focused on a few specific themes, whereas the area of specific themes of the tourism corpus is larger. So, 80 % of the finance corpus is covered by 51 themes, whereas for tourism we need 96 themes. The first 15 themes are listed in order of their scores in Table 1. The theme "tlr" is associated with words that have a general meaning, not easily ascribable to a specific domain ("tlr" signifies "all the rest" / "tout le reste"): retour, vie, nature, exemple, reste. The following themes are specific to finance: droit, commerce, argent, finance, banque, comptabilite, entreprise. For tourism we have: habitat, deplacement, voie, voyages, commerce, urbanisme, chef-lieu, transport, ville. Frequent everyday words cover themes like "temporalite" (jour, anneé, temps, espace, heure, periode) or "agirFaire" (activite, action plan…). Many ambiguous words have their meanings clustered in particular themes, which are artificially promoted in the ranking shown in Table 1. For instance the "maths" theme is covered by lemmas like groupe, solution, indice, produit, catégorie while "informatique" appears as one possible theme of lemmas such as page, information, liste, address.
Finance Theme tlr droit commerce argent agirfaire math temporalite finance banque comptabilite informatique enterprise rel-ssociales economie monnaies
score % 10.98207 7.77102 5.77705 5.21507 4.06166 3.60900 3.37726 3.00270 2.78904 2.46459 2.42130 2.08937 1.79555 1.62876 1.32468
Tourisme Theme tlr habitat commerce temporalite droit deplacements voies voyage telecomm pays urbanisme chef-lieu agirFaire transport ville
score % 10.14787 4.39096 2.91312 2.69326 2.51338 2.14071 2.11299 2.01960 1.83702 1.59403 1.58601 1.53578 1.24141 1.17662 1.17313
Table 1: The highest scored themes in the two corpora
To emphasize the specificity of a corpus, we propose a second version of the algorithm, which considers the ambiguity degrees of the lemmas contained in the corpus. We would like to give more credits as domain clues to those words that are mono-thematic or have less themes than the common words. Therefore, we strengthen the scores of the themes associated to non-ambiguous words and we decrease the score for themes of ambiguous ones: the penalty is proportional to the ambiguity degrees of the corresponding lemmas. This translates into the hypothesis (H3): (H3) the differentiating power of a word occurrence is inversely proportional to ambiguity level of its associated lemma. Finance Theme entreprise comptabilite gestion finance commerce droit banque argent Tlr monnaies economie temporalite telecomm informatique diriger
score % 5.80855 4.37064 4.19161 4.10151 3.96656 3.85233 3.06333 2.92832 2.88394 2.84109 2.56046 2.53729 2.49117 2.45593 2.32274
Tourisme Theme Telecomm voyage chef-lieu Habitat commerce Loisir urbanisme Ville Accueil calendrier jour Regions tlr restauration Capitales
score % 4.70470 4.27568 4.06673 3.59820 3.20029 2.38033 2.28100 2.26503 2.01336 1.95958 1.93612 1.86988 1.83721 1.65612 1.33815
Table 2:The logarithmic corrected highest scored themes in the two corpora
Document classification by random sampling and vectorial projection
We model this hypothesis by associating each lemma with a decreasing function depending on the ambiguity degree (f(ni), f : N*→R). We define f(ni) as ln-1(k+ni), so the Ci coefficient will be: Ci=Occi/ln(k+ni), where Occi is the relative frequency of lemma Li, ni is the ambiguity degree for Li and k is a parameter. The choice of the parameter was one of the goals of our experiments and the subject of the next section. By computing the theme scores considering the lemma coefficients Ci modified as described before, the ordering of the themes in the Finance and Tourism corpora was significantly changed. The new rankings (for k=0.1) are shown in Table2 and they obviously support better the empirical linguistic motivation according to which
domain specific words should have a higher weight in classification of a new document.This time the specific themes were filtered and promoted in the classification. In the finance corpus we have high scores for: entreprise, comptabilite, gestion, finance, commerce, droit, banque, argent and for tourism the high-scoring themes are: telecommunication, voyage, chef-lieu, habitat, commerce, loisir, urbanisme, ville, accueil. The words "telephone" and "fax", which have a high frequency of occurrence in the tourism domain, favor the theme "telecommunication". The diagram shown in Figure 3 visualizes the effect of the logarithmic factor on the theme scores (in the finance corpus).
tlr
Figure 3: Theme Scores in the Finance Corpus; positive values show the scores before logarithmic correction; negative values show the scores after the correction
For a more intuitive perception of this effect, we represented on the same diagram the theme scores before the logarithmic correction (the positive Ys) and after the proposed correction (the negative Ys). The horizontal axis represents the themes which are identified by their position in the alphabetically ordered list of all the themes One can notice in the upper part of the diagram in Figure 3 that the theme "tlr" had the highest score (0,1) before the logarithmic correction. After the logarithmic correction, the score for the "tlr" theme significantly decreased (0,02) as shown in the lower part of the Figure 3.
The four diagrams in Figure 4 show the effect of various values of the k parameter (0.1, 0.5, 1 and 11) on how different lemmas contribute to the score of a theme. When k is small (k=0.1) the occurrences of non-ambiguous lemmas strongly contribute the score of a specific theme. For larger values of k, an occurrence of an unambiguous lemma is almost as influential in the score of its theme TH as the occurrence of a 20-way ambiguous lemma that happens to have TH as one of its possible meanings. Apparently, k should be kept as low as possible, leaving the entire discriminative burden on the However, as we will show in the next section, this is not the
Dan TUFIŞ, Camelia POPESCU, Radu ROŞU
case. One explanation for this is that the lemmas which are mono-thematic are rare. Given that our classifiers make their decisions based on random
samples extracted from the documents to be classified, such lemmas might not appear at all in the samples.
Figure 4: The effect of various values of the k parameter (0.1, 0.5, 1, 11) on how different lemmas contribute the score of a theme.
3.
EXPERIMENTS AND RESULTS
A document classification algorithm must be not only as accurate as possible, but also very fast. The speed is of utmost importance when the classifier is supposed to work with vast number of documents (such as a web-document classifier). This is why the statistical algorithms, running, usually, in linear time, received so much attention and praise. However, all the accurate document classification methods we are aware of assume processing the entire document under consideration. We argue that by random sampling of a document, one can achieve a very good classification accuracy, but in much less time. The speed gain (for a linear classification algorithm) is approximately the ratio between the size of the document versus the size of the sample1. The way we defined the sores of the themes appearing in a document to be classified is sensitive as shown previously, to the value of the parameter k. We show that by using different values for this parameter one could get a set of classifiers, with 1
Since only nouns are considered in our method (as in many other classification algorithms) the two sizes refer to the number of nouns appearing in the document and the number of nouns randomly selected in the sample.
statistical distinct behaviour. A natural question is which would be the best classifier for a given task, that is what would be the value of the parameter k, so that the classification task to be performed with as few errors as possible. The experimental setting was as follows: having the training data pre-classified into 2 domains D1 and D2 and a test data set known to be left aside from D2, we used various sample sizes and various values for the k parameter and also the two methods for theme score calculation (simple and iterative) described in section 2. In a first round of experiments we set D1 to "Finance" and D2 to "Tourism". For the second round of experiments, we set D1 to "Tourism1" and D2 to "Tourism2" thus aiming to solve a more difficult problem, namely to classify documents in two related classes. Due to the space limitations we will address only the second kind of experiments. We should however mention, that the accuracy of document classification between "Finance" and "Tourism" (the experiments not discussed here) was 100%, showing that when the domains are very different, the automatic classification can be done very reliably. As we said before, the classification of a document is decided based on the closeness of its theme vector to the domain specific theme vectors (actually to the vectorial cones). To experiment with various classifiers and measure their performance as a function of different
Document classification by random sampling and vectorial projection
parameters (k, sample size, methods for computing the theme scores) the training process was repeated for each instantiation of a classifier. So, a classifier Ck calculates the theme vector for the training texts (TH_D1 and TH_D2). Then a sample P of (nouns) lemma is drawn randomly from a document to be classified (belonging to the test data set) according to a regular probability distribution. The classifier Ck computes for P the theme vector (TH_P). The document from which P was extracted is labelled with the type of the training corpora with the biggest correlation measure between its theme vector and the theme vector of P. For the size of P we experimented with different values: 50, 100, 150, 200, 250, 300. We considered a normal distribution of the difference between the correlation measures: corr1-corr2=corr(TH_D1,TH_P)-corr(TH_D2,TH_P)
We tested two variants of classifiers. The first one follows from (H1) and (H2) therefore it is based on the iterative algorithm and it uses the theme vector established after the stabilization of the scores. The second one issues from (H1) and (H2*) and uses the theme vector calculated by the first step of iteration. We call it the simple procedure. The diagrams in Figure 5 and Figure 6 show the QQ-plots of a random sample form N(0,1) and of a P sample.
Figure 5: 1000 random numbers from the N(0,1)
As one can see, the similarity of the two plots strongly support the normal distribution assumption. Therefore knowing that P is from D2 we estimate the error rate as the probability that the standard normal distribution has values less than -mean(corr1,corr2)/std(corr1,corr2)).
Figure 6: The 1000 simulation data (k=11, 50/T1F/simple)
The training data we had at our disposal came from two domains: 1000 Web pages on finance and 2000 Web pages on tourism. We built three corpora, each containing 1000 Web pages: one on finance (F) and two on tourism (T1 and T2). In the first round of experiments the training corpora were F and T1 with the documents in T2 used for testing. As we said before, this was a 100% accurate classification. The second round of experiments used T1 and T2 as training data and T2 for testing purposes. This time the choice was between two types in the same domain. As further shown, our classifiers were able to reliably differentiate between the two sub-corpora of T as they have different emphasis on various aspects (gastronomy versus entertainment for instance). For each experiment k took value in the set {0.1 0.2 … 0.9 1 2 …11 12…20} and the results were calculated for the test samples of different sizes. As we said before, the families of classifiers defined by the different underlying hypotheses and by various values of the parameter k were evaluated in order to test whether the members of the family were or not distinct classifiers. To this end we used the MacNemar's test, briefly presented in the following. Let it be two classifiers f1 and f2. Let it be T the document test set containing n documents. For each x∈T, we record how it was classified and construct the contingency table: n00 = number of examples misclassified by both f1 and f2 n10 = number of examples misclassified by f2 but not by f1
n01 = number of examples misclassified by f1 but not by f2 n11 = number of examples misclassified by neither f1 nor f2
Dan TUFIŞ, Camelia POPESCU, Radu ROŞU
Let |T|=n= n00 + n01+ n10+ n11. Under the null hypothesis (that is both classifiers would work with the same accuracy), n01= n10 with expected count values of (n01+ n10)/2. The following statistic is distributed (approximately) as χ2 with 1 degree of freedom (the term –1 in the numerator represents a “continuity correction”, to account for the fact that the statistics is discrete while the distribution χ2 – which McNemar’s test is based on- is continuous; see [Dietterich, 1998] for details on deriving this distribution): χ2 =
(| n01 − n10 | −1) 2 n01 + n10
χ12,0.95 = 3,84146
If the null hypothesis (the compared classifiers are statistically indiscernible) is correct with a level of confidence of 95% (the probability of wrongly rejecting the null hypothesis is less than 0.05) then this quantity is less then 3,84146. Let us turn now to the experimental results. For identification purposes we use the labelling size/T1-T2/method with size referring to the number of randomly extracted lemmas (50, 100, 150, 200, 250, 300), T1-T2 showing that the classification process means to label a new document as either T1 or T2, and method one of simple or iterative (as described before). For instance 100/T1-T2/simple identifies a set of experiments (further differentiated by the value of the K parameter) where the classifiers trained on T1 and T2, based on 100 randomly extracted lemmas from each document and using the simple method, classified the 1000 documents contained in T2. We used McNemar's test to estimate whether two classifiers behave statistically distinct or not: N0x represents the total number of errors done by the first classifier, Nx0 represents the total number of errors done by the second classifier, N01 represents the number of errors done only by the first classifier and N10 are errors done only by the second classifier. The Table 3 presents the results of the 100/T1-T2/simple versus 100/T1-T2/iterative classifiers. Each line in the table corresponds to a different 2 value of the K parameter . The bolded lines show classifiers with statistically distinct behavior (for a 95% confidence threshold).
2
The upper limit (11) for k in our experiments is not arbitrary. We built classifiers with the k parameter set to values higher than 11 and the results were statistically similar with those obtained without using the logarithm factor (logsymmetry around 1).
As the table 3 shows, the best performance for the classifiers working with samples of 100 lemmas correspond to k=10 and the simple method.. For k=10 the 100/T1-T2/simple classifier classifies correctly 906 documents out of 1000 making only 94 errors. The best performance of the iterative classifiers is achieved by the one with k = 0.8. The table 3 also shows that the classifiers defined by the method of theme score calculation are in general statistically different classifiers. K
N0x
Nx0
N10
N01
McNamer's test value
0.1
159
169
2
12
5,785714
0.2
158
168
7
17
3,375
0.3
157
158
23
24
0
0.4
150
159
26
35
1,04918
0.5
141
155
24
38
2,725806
0.6
136
153
26
43
3,710145
0.7
134
148
30
44
2,283784
0.8
124
148
22
47
8,347826
0.9
123
149
26
52
8,012821
1.0
124
149
27
52
7,291139
2.0
115
157
27
69
17,51042
3.0
107
162
26
81
27,25234
4.0
101
163
24
86
33,82727
5.0
99
164
26
91
35,00855
6.0
96
161
26
91
35,00855
7.0
96
163
27
94
36,00
8.0
95
167
27
99
40,00794
9.0
95
172
27
104
44,0916
10
94
171
28
105
43,42857
11
94
169
29
104
41,17293
Table 3: Comparing 100/T1-T2/simple and 100/T1-T2/ iterative classifiers for the same various K values
The Table 4 displays the same results when using a sample of 300 lemmas. The accuracy of the classification process significantly increased for all classifiers. For values of k higher than 2.0, the 300/T1-T2/simple and 300/T1-T2 classifiers display a statistical distinct behaviour (with 95% certainty). Again, the simple method shows superior results to the iterative one, for k=10 the 300/T1T2/simple classifier making only 6 errors in the classification of the 1000 test documents.
Document classification by random sampling and vectorial projection
McNamer's test value
Once we showed that the method of theme score calculation defines statistically distinct classifiers, we checked that for a given method, by considering only the value of k, one gets also different statistical classifiers. Table 5 shows this for the two 100/T1-T2/simple classifiers and Table 6 for two 300/T1-T2/simple classifiers. In both tables, the comparison is made between the classifiers corresponding to the k values set first to 0.1 and then to 11.
K
N0x
Nx0
N10
N01
0.1
35
35
2
2
0,25
0.2
34
31
6
3
0,444444
0.3
33
29
8
4
0,75
0.4
29
26
7
4
0,363636
0.5
25
25
6
6
0,083333
0.6
24
27
5
8
0,307692
0.7
24
25
8
9
0.8
24
26
8
10
0,055556
0.9
24
26
8
10
0,055556
1.0
23
26
7
10
0,235294
2.0
17
28
7
18
4
3.0
13
28
4
19
8,521739
4.0
11
30
3
22
12,96
5.0
10
31
3
24
14,81481
6.0
8
30
2
24
16,96154
0.1/11
7.0
8
31
2
25
17,92593
Table 6: 300/T1-T2/simple classifiers for K 0.1& 11
8.0
8
29
3
24
14,81481
0
9.0
7
29
2
24
16,96154
10
6
28
2
24
16,96154
11
6
28
2
24
16,96154
K 0.1/11
N0x
Nx0
N10
N01
McNamer's test value
159
94
84
19
39,76699
Table 5:Two 100/T1-T2/simple classifiers for K: 0,1 and 11 K
N0x
Nx0
N10
N01
McNamer's test value
35
6
32
3
22,4
The diagram in Figure 6 shows the decrease of error rate probability and of the width of the dispersion intervals for the error probability as the lemma sample size increases.
Table 4: Comparing 300/T1-T2/simple and 300/T1-T2 iterative classifiers for the same various K values
Figure 6: The Error Probability (with 95% confidence) for sizexx/T1-T2/simple, classifiers with k=11
Dan TUFIŞ, Camelia POPESCU, Radu ROŞU
4.
Conclusions and further work
We proposed a document classification method based on random selection of a small set of words (nouns). The basic idea is to exploit the nonindependent word meaning distribution in coherent thematic texts by statistically sound methods. The selected wordforms are lemmatized and further used to compute a vector in an N-dimensional space (the number of themes as defined by the thesaurus). From the training data (documents already classified) we compute reference vectors against which the theme vectors of new texts are evaluated. We have shown that by using a parameter to smoothen the significance of lemma frequency by taking into account its polysemy degree one defines a family of distinct classifiers, out of which, empirically, one may be chosen. We have also shown that the classifiers built by the simple method are better performing that those resulted from the iterative construction method. One possible explanation might be that the thesaurus we used contains noisy distinctions (i.e. word senses which are very difficult to differentiate by distributional analysis) and that the iterative method tends to amplify this noise (strengthening the discriminative power of some distinctions which cannot be reliably made). For the simple algorithm a high value of k is better, and irrespective of k using a larger set of lemma samples (300 instead of 100) the number of classification errors significantly decreases. The method is sensitive to the underlying thesaurus content and structuring. That is given the same training data and the same test data, reference vectors and the classification results may change when different thesauri are used. On the other hand, with a given thesaurus and test data set, for various training data sets, the accuracy in classification of the test data will be sensitive to the homogeneity of the training data sets. This suggests other possible uses of the method we presented here and further investigation: - assessing the computational effectiveness of the ontological organization of a lexical knowledge base; if the concepts are differentiated by the ontology in a computationally meaningful way, it is reasonably to expect good classification results. The assumption here is that in case a classification process is more accurate when using a thesaurus TH1 instead of TH2, then the first is better structured (at least with respect to a distributional kind of analysis).
- comparing various special corpora (as used in training document classifiers) with respect to their homogeneity. The less homogenous training corpora, the higher the classification errors of the induced classifiers. So, observing the performance of similar classifiers trained on different corpora of the same kind, one could estimate the homogeneity of the two corpora. The results show that the choice of parameter depends on the domains we must distinguish between: for closer or similar domains we don't need to accentuate the specificity using a low parameter, but for quite different domains a small parameter is better. The probability of incorrect choice decreases for a small size sample, but the choice of the optimal parameter doesn't change. REFERENCES 1. BIBER, D., Using Register-Diversified Corpora for General Language Studies. Computational Linguistics, Vol. 19, no. 2, p. 219-241, 1993. 2. COHEN, W. W., SINGER, Y., Context Sensitive Learning Methods for Text Categorization, ACM Trans. Inf. Syst. 17, 2, p. 141-173, 1999. 3. DIETTERICH, T. G., Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, 1998, http://www.cs.orst.edu/~tgd/cv/pubs.html. 4. KARLGREN, J., Stylistic Experiments in Information Retrieval. In T. Strazalkowski (ed.) Natural Language Information Retrieval, p. 147-166, Kluwer, 1999. 5. KILGARRIFF, A., Comparing Corpora, in International Journal of Corpus Linguistics, John Benjamins Publishing Company (to appear). 6. LEWIS, D. D., GALE, W. A., A Sequential Algorithm for Training Text Classifiers. In Proceedings of SIGIR'94, p. 312, 1994. 7. LEWIS, D., D., KNOWLES, K., A., Threading electronic mail: A preliminary study. In Information Processing and Management, Vol. 33, no. 2, p.209-217, 1997. 8. LI, H. & YAMANISHI, K., Document Classification Using o Finite Mixture Model, in Proceedings of ACL 1998. 9. NIGAM, K., MCCALLUM, A., K., MITCHELL, T., Text Classification from Labeled and Unlabeled Documents Using EM. In Machine Learning, Kluwer Academic Publishers, p. 1-34 , 1999. 10. PAZZANI, M., J., MURAMATSU, J., BILLSUS, D., Syskill & Webert: Identifying interesting Web sites. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pp. 54-59, 1996. 11. SEKINE, S., The domain dependence of parsing. In Proceedings of the 5th Conference on Applied NLP, pp. 96102, 1998. 12. TUFIŞ, D., Using a Large Set of Eagles-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging, In Proceedings of the Second International Conference on "Language Resources and Evaluation", pp. 1105-1112, 2000.