Reliability in ICA-based Text Classification - Semantic Scholar

1 downloads 0 Views 138KB Size Report
reliability of ICA-based text classifiers, attempting to make the most of the independent component ..... Proc. of the 29th International. Conference on Acoustics ...
Reliability in ICA-based Text Classification Xavier Sevillano, Francesc Al´ıas, and Joan Claudi Socor´ o Communications and Signal Theory Department Enginyeria i Arquitectura La Salle. Universitat Ramon Llull Pg. Bonanova 8, 08022 - Barcelona, Spain {xavis, falias, jclaudi}@salleURL.edu

Abstract. This paper introduces a novel approach for improving the reliability of ICA-based text classifiers, attempting to make the most of the independent component analysis data. In this framework, two issues are adressed: firstly, a relative relevance measure for category assignation is presented. And secondly, a reliability control process is included in the classifier, avoiding the classification of documents belonging to none of the categories defined during the training stage. The experiments have been conducted on a journalistic-style text corpus in Catalan, achieving encouraging results in terms of rejection accuracy. However, similar results are obtained when comparing the proposed relevance measure to the classic magnitude-based technique for category assignation.

1

Introduction

In the last decade, text classification (TC) has become a focus of interest for the information retrieval research community, due to the need to organize the rapidly growing amount of digital documents available worldwide. The main approaches for TC belong to the machine learning paradigm (see [1] for an extensive review). Generally, the TC process is divided in two phases: training and test. Training consists of building a classifier from a collection of documents (training set). Testing is the process of classifying a new group of documents (test set) according to the structure learned during the training phase. In the context of TC, the application of Independent Component Analysis (ICA) is based on the assumption that a document collection (or corpus) is generated by a combination of several thematic topics [2, 3, 4]. Thus, the independent components (IC) obtained by the ICA algorithms define statistically independent clusters of documents, allowing their thematic classification. Mainly, Independent Component Analysis has been used for text retrieval [2, 5], text classification [3] and text clustering [4, 6]. Another interesting application of ICA is chat topic spotting [7, 8], which takes into account the temporal dimension of text data. ICA has also been employed as a feature extraction technique in order to improve the performance of Na¨ıve Bayes [9] and Support Vector Machines [10] classifiers. Moreover, in the context of multi-domain textto-speech synthesis [11], an ICA-based hierarchical text classifier was presented in [12].

This paper presents a first step towards making the most of the information provided by ICA-based text classifiers in order to increase the reliability of the classifier. In this sense, we introduce i) a relative relevance measure for category assignation, and ii) a novel reliability evaluation process to inhibit the ICA-based text classifier from categorizing documents belonging to none of the categories defined during the training stage. This paper is structured as follows: section 2 reviews the fundamentals of the application of ICA in text classification. In section 3, the proposals for improving the reliability in ICA-based text classification are presented. Section 4 describes the conducted experiments and, finally, the conclusions of our work are discussed in section 5.

2

Fundamentals of ICA for Text Classification

Text classification is defined as the task of assigning a collection of documents D = {d1 , d2 , . . . , d|D| } to a set of predefined categories C = {c1 , c2 , . . . , c|C| } [1]. Many TC methods are based on the vector space model (VSM) representation [1]. As a result, each document (dj ) is defined as a vector of weights (wj ) related to the terms composing the text. Thus, a document corpus containing |D| documents and |T | terms is represented by means of a term by document matrix X:  d1 d2 d3 · · · d|D|  t1 w11 w12 w13 · · · w1|D|      ..  t2 w21 w22  .   X=  .. . . . . . .  t3  . . . .    .  .. .. .. ..  ..  . . . . t|T | w|T |1 · · · · · · · · · w|T ||D| 

(1)

In order to reduce the dimensionality of X and avoid overfitting, term space reduction techniques are applied: i) topic-neutral words such as prepositions, conjuctions, etc. are removed (stoplisting), ii) different morphological inflections of a word are merged into a same form (stemming) and iii) terms that occur in few documents are eliminated (document frequency thresholding) [1]. Using ICA for TC is related to the application of Latent Semantic Analysis (LSA) [13]. This term extraction technique projects the data onto an orthogonal document space1 of reduced dimensionality by means of Singular Value Decomposition (SVD), extracting the K principal components from the data. This procedure is equivalent to the whitening and dimension reduction preprocessing steps that are usually applied to simplify the ICA problem [14]. 1

Data can be also be projected onto a term space by using LSA, which allows keyword identification (see [3, 4]), but this is beyond the scope of this paper.

Applying an ICA algorithm on the LSA data yields a matrix S containing K independent components (IC):   s11 s12 . . . s1|D|  s21 s22 . . . s2|D|    T (2) S = (s1 s2 . . . sK ) =  .  .. . . .  .  . . sK1 sK2 . . . sK|D| Each independent component (sk for k = {1, . . . , K}) defines a cluster of documents with a common thematic category. Consequently, documents are classified under K categories. The ICA-based text classifier comprises the separating matrix W and the IC of the documents of the training set, obtained by the ICA algorithm. Hence, new unseen documents (queries, qj ) composing the test set are classified by comparing their IC with the training IC, after projecting them onto the ICA space by means of matrix W. Moreover, a hierarchical organization of documents can be derived by performing a sweep of values of K (space dimensionality) in the neighbourhood of the number of thematic categories contained in the corpus (|C|) [12].

3

Towards a Reliable ICA-based Text Classifier

The performance of a text classifier is challenged when classifying unseen documents (i.e. during the test stage). Its reliability can be improved if i) the misclassification of queries that should be categorized under one of the categories contained in the training set is minimized, and ii) the rejection of queries belonging to none of the training categories (out-of -domain, OOD) is maximized. The following sections describe our proposals for addressing both issues. 3.1

A Relative Measure of Relevance

Selecting the most suitable category to a given document depends on the degree of belonging (relevance) of that document to such category. In the classic TC literature [1], the concept of relevance is defined by means of a function CSVk (dj ) (categorization status value) that measures the appropriateness of classifying document dj under category ck : – in hard (fully-automated) text classification, document dj is assigned to the category that attains a maximum CSVk (dj ). – in soft (semi-automated) text categorization, a ranking of the categories —according to their CSVk — is presented to the user . In ICA-based TC, the categorization status value has been commonly defined as the magnitude of the independent components of each document [6] (see equation 3), often normalized by a softmax transformation [3, 4].

CSVk (dj ) = skj

(3)

Nevertheless, Kab´ an and Girolami [4] suggest avoiding direct comparison of IC values, due to their possible different scaling, which can produce misclassifications. Taking this idea into account, we propose a relative CSVk (dj ), which is defined in equation 4. CSVk (dj ) = Fk (skj )

(4)

where Fk denotes the cumulative distribution function (cdf ) of the k-th independent component. This method is insensitive to the distribution of the IC values, as it measures the relative relevance of document dj to category ck . 3.2

Evaluating the Reliability of the Classification Decisions

A reliable classifier should be able to measure the appropriateness of its decisions, notifying the user about the confidence level of each classification (in an interactive TC system) or rejecting the query (in a fully automatic TC framework). In this section, we propose a method for evaluating the reliability of the classification decisions, based on the definition of a relevance index vector containing the CSVk values corresponding to document dj (equation 5). RI(dj ) = [ CSV1 (dj )

CSV2 (dj )

...

CSVK (dj ) ]T

(5)

In order to reject OOD queries, it is necessary to define a confidence region for each category during the training phase. The confidence region of category ck (CRk ) is defined in terms of two parameters: – the mean value of RI(dj ), µ (CRk ) – the standard deviation of RI(dj ), σ (CRk ) both computed over the space of training documents correctly classified under the k-th category. During the test phase, each query qj is classified under the k-th category if RI(qj ) is fully contained within the confidence region CRk , i.e. a conditional classification is performed.

4

Experiments

The following experiments have been conducted on a collection of articles extracted from the Catalan newspaper AVUI, compiled during three periods of time in 2000, 2003 and 2004. This journalistic-style corpus is composed of 180 documents (3400 terms) divided into four thematic domains (fields): D = {politics (POL: 60 documents), music (MUS: 40 documents), theatre (TEA: 40 documents), economy (ECO: 40 documents)}. Documents are represented in the VSM, weighted by their normalized tf×idf [1] and the document frequency threshold is experimentally fixed to 4.

The classifier is trained with the 90% of the documents of the corpus. All experiments are conducted following a 10-fold cross-validation scheme, where the training and test sets are chosen randomly. Moreover, the classifier is obtained applying the version of FastICA [14] that maximizes the skewness of the IC [4]. 4.1

Comparison of Relevance Measures

This experiment compares the classic method for CSVk computation (magnitudebased) to the proposed cdf -based technique. Table 1 presents the classification accuracy (Ac ) [1] attained by both methods during the training and test phases. Both measures offer similar classification accuracies, but depending on the distribution of the IC values, misclassifications are differently scattered. The proposed cdf -based relevance measure offers a modest increase of the averaged classification accuracy during the test stage, while attaining similar results during the training phase. Table 1. Classification accuracies obtained by the ICA-based classifier comparing the magnitude-based to the cdf -based CSVk function during the training and test phases Ac

Relevance measure

4.2

TRAINING

POL MUS TEA ECO

Magnitude-based cdf -based

.787 .802

TEST

POL MUS TEA ECO

Magnitude-based cdf -based

.817 .800

1 .951 .900 1

1 1 1 1

.986 .989 .975 .975

Performance of the Classification Reliability Control

The following experiments evaluate the performance of the proposed conditional categorization method presented in section 3. In order to test the ability of the classifier to reject OOD queries, the test set described at the beginning of section 4 was enlarged by adding documents concerning society (SOC: 60 documents) and sports (SPO: 40 documents). Firstly, the optimal definition of the confidence regions CRk is studied. Table 2 presents the microaveraged rejection accuracy2 (Aµr ) obtained by the proposed reliability control process for three definitions of CR k . The best results are obtained when the upper and lower bounds of CRk are set to µ (CRk ) ± 2σ (CRk ), respectively. 2

Aµr is defined as the ratio between the number of rejected OOD documents (true positives) plus the number of non-rejected classifiable documents (true negatives) to the total number of test documents.

Table 2. Microaveraged rejection accuracies obtained by the reliability control process for three confidence region definitions during the test process (µ and σ stand for µ (CRk ) and σ (CRk ), respectively) CRk

Aµr

µ ± σ .463 µ ± 2σ .812 µ ± 3σ .645

Figure 1 illustrates the suitability of the optimal confidence region definition for rejecting OOD queries, as none of them (see the third and fourth rows of figure 1) is fully contained in the confidence regions of the training categories, in contrast to the rest of the queries (shown on the first and second rows of figure 1). The depicted data correspond to one arbitrarily chosen experiment. After performing the 10-fold cross-validation experiments, the rejection accuracies for each category (using the optimal definition of the confidence regions) are presented in table 3. Table 3. Rejection accuracies (Ar ) for each category using optimal CRk during the test process Category

Ar

Category

Ar

Category

Ar

POL MUS

.300 .150

TEA ECO

.075 .100

SOC SPO

.850 .825

Note that the OOD queries are notably rejected (see the fifth and sixth columns of table 3). In average, only the 15% of the test documents corresponding to the training categories are rejected. Nevertheless, approximately the 50% of these rejected documents would end up in misclassifications if no reliability control was employed. Furthermore, only a 4% of the non-rejected test documents are misclassified. As a conclusion, a more reliable ICA-based text classifier is obtained, without a significant loss of overall classification accuracy.

5

Conclusions

In this paper we have presented a first step towards reliable ICA-based text classification. In this context, a relative CSV function has been evaluated in contrast to the magnitude-based relevance measure. Although the proposed measure is insensitive to the IC distributions, little impact on the classification accuracy is observed. In addition, a method for controlling the reliability of the classification decisions has been presented. The experiments demonstrate good rejection accuracies of out-of-domain queries, without a significant loss of performance.

POL

MUS

1

0.6

0.6

0.6

0.2

0.2

2

3

0 1

4

2

3

k POL vs POL

1

CSV

0.4 0.2

3

MUS vs MUS

1

TEA vs TEA

0.6

0.6

0.6

0.4

0.4 0.2

3

0 1

4

1

0.4

2

3

0 1

4

1

SOC vs TEA

0.6

0.6

0.6

0 1

0.2

2

3

0 1

4

0.4

2

3

0 1

4

2

k

SPO vs POL

1

0.4

0.2

0.2

k 1

CSV

0.6

CSV

0.8

CSV

0.8

0.4

3

0 1

4

1

SPO vs TEA

1 0.8

0.6

0.6

0 1

CSV

0.8

0.6

CSV

0.8

0.6

0.2

0.4 0.2

2

3

k

4

0 1

0.4 0.2

2

3

k

3

4

0 1

4

k

0.8

0.4

2

k

SPO vs MUS

4

SOC vs ECO

1

0.8

0.2

3

k

0.8

0.4

2

k

SOC vs MUS

4

0.2

k

SOC vs POL

3

ECO vs ECO

1 0.8

2

2

k

0.8

0 1

4

0 1

4

0.8

k

CSV

3

k

0.2

2

2

CSV

CSV

0.6

1

4

0.4 0.2

k

0.8

0 1

0 1

CSV

1

0.4

CSV

0 1

0.4

CSV

0.6

CSV

0.8

0.4

ECO

1

0.8

0.2

CSV

TEA

1

0.8

CSV

CSV

1 0.8

SPO vs ECO

0.4 0.2

2

3

4

0 1

2

k

Fig. 1. The top row shows the components of the relevance index vectors RI(dj ) of the training documents (from left to right: politics, music, theatre and economy). The second row presents the components of RI(dj ) of the test documents (thin line) -from left to right: politics, music, theatre and economy- compared to the upper and lower bounds of the confidence regions of the corresponding training category (thick line). The third row presents the components of RI(dj ) of the society test documents (thin line) compared to the upper and lower bounds of the confidence regions of each training category (thick line). And finally, the bottom row presents the components of RI(dj ) of the sports test documents (thin line) compared to the upper and lower bounds of the confidence regions of each training category (thick line)

3

k

4

Further studies will be focused on evaluating the performance of the proposed relevance measure on other text corpora and improving the reliability control process. Moreover, the presented reliability postprocessing can provide useful information in the context of hierarchical text classifiers (e.g. for choosing the most appropriate level of hierarchy) and dynamic text classification systems (e.g. for discovering new categories).

References [1] Sebastiani, F.: Machine Learning in Automated Text Categorisation. ACM Computing Surveys, Vol. 34, Nr. 1 (2002) 1–47 [2] Isbell, C.-L. and Viola, P.: Restructuring Sparse High Dimensional Data for Effective Retrieval. Adv. in Neural Information Processing Systems, 11 (1999) 480–486 [3] Kolenda, T., Hansen, L.K. and Sigurdsson, S.: Independent Components in Text. In: Girolami, M. (ed.): Advances in Independent Component Analysis. SpringerVerlag, Berlin Heidelberg New York (2000) 241–262 [4] Kab´ an, A. and Girolami, M.: Unsupervised Topic Separation and Keyword Identification in Document Collections: a Projection Approach. Dept. of Computing and Information Systems, University of Paisley. Technical Report Nr. 10 (2000) [5] Kim, Y.-H. and Zhang, B.-T.: Document Indexing Using Independent Topic Extraction. Proc. of the 3rd International Conference on Independent Component Analysis and Blind Signal Separation (ICA’2001). San Diego, USA (2001) 557–562 [6] Kolenda, T.: Clustering text using Independent Component Analysis. Inst. of Informatics and Mathematical Modelling, Tech. University of Denmark. T.R.(2002) [7] Kolenda, T. and Hansen, L. K.: Dynamical Components of Chat. Inst. of Informatics and Mathematical Modelling, Tech. University of Denmark. T.R.(2000) [8] Bingham, E.: Topic Identification in Dynamical Text by Extracting Minimum Complexity Time Components. Proc. of the 3rd International Conference on Independent Component Analysis and Blind Signal Separation (ICA’2001). San Diego, USA (2001) 546–551 [9] Bressan, M. and Vitri` a, J.: Improving Naive Bayes Using Class Conditional ICA. In Garijo, F., Riquelme, J. and Toro, M. (Eds.), Advances in Artificial IntelligenceIberamia, Springer Verlag Series. (2002) LNAI 2527:1–10 [10] Takamura, H. and Matsumoto, Y.: Feature Space Restructuring for SVMs with Application to Text Categorization. Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP’2001). Pittsburgh, USA (2001) 51–57 [11] Al´ıas, F., Iriondo, I. and Barnola, P.: Multi-domain Text Classification for Unit Selection Text-to-Speech Synthesis. Proc. of the 15th International Congress of Phonetic Sciences (ICPhS’2003). Barcelona, Spain (2003) 2341–2344 [12] Sevillano, X., Al´ıas, F. and Socor´ o, J.C.: ICA-Based Hierarchical Text Classification for Multi-domain Text-to-Speech Synthesis. Proc. of the 29th International Conference on Acoustics, Speech and Signal Processing (ICASSP’2004). Montreal, Quebec, Canada (2004) (to appear) [13] Deerwester, S., Dumais, S.-T., Furnas, G.-W., Landauer, T.-K. and Harshman, R.: Indexing by Latent Semantic Analysis. Journal American Society Information Science, Vol. 6, Nr. 41 (1990) 391–407 [14] Hyv¨ arinen, A., Karhunen, J. and Oja, E.: Independent Component Analysis. John Wiley and Sons, New York (2001)

Suggest Documents