Automating Personal Categorization Using ... - Semantic Scholar

Automating Personal Categorization Using Artificial Neural Networks Dina Goren-Bar, Tsvi Kuflik, Dror Lev, and Peretz Shoval Department of Information Systems Engineering Ben Gurion University of the Negev Beer-Sheva Israel (dinag,tsvikak,dlavie,shoval)@bgumail.bgu.ac.il

Abstract. Organizations as well as personal users invest a great deal of time in assigning documents they read or write to categories. Automatic document classification that matches user subjective classification is widely used, but much challenging research still remain to be done. The self-organizing map (SOM) is an artificial neural network (ANN) that is mathematically characterized by transforming high-dimensional data into two-dimensional representation. This enables automatic clustering of the input, while preserving higher order topology. A closely related method is the Learning Vector Quantization (LVQ) algorithm, which uses supervised learning to maximize correct data classification. This study evaluates and compares the application of SOM and LVQ to automatic document classification, based on a subjectively predefined set of clusters in a specific domain. A set of documents from an organization, manually clustered by a domain expert, was used in the experiment. Results show that in spite of the subjective nature of human categorization, automatic document clustering methods match with considerable success subjective, personal clustering, the LVQ method being more advantageous.

1 Introduction 1.1 Motivation Information users define subjective topic trees or categories, based on personal preferences, and assign documents they read or write to categories according to this subjective definition. Now that the amount of publicly available information on the web is increasing rapidly by roughly a million pages every day [3], automation of categorization and filtering of documents has become an essential procedure. For years, manual categorization methods were defined and employed in libraries and other document repositories based on human judgment. Organizations and users searched for information and saved it in categories in a way that was meaningful to them. Obviously, the categories used by information users may be idiosyncratic to the specific organization or user, but not to an automatic classification system. A major problem in generating an automatic, adaptable system is how close can the automatic system reflect the subjective point of view of the user, regarding his/her domain(s) of interest. Users are usually inconsistent, having a M. Bauer, P.J. Gmytrasiewicz, and J. Vassileva (Eds.): UM 2001, LNAI 2109, pp. 188–198, 2001. © Springer-Verlag Berlin Heidelberg 2001

Automating Personal Categorization Using Artificial Neural Networks

189

tendency to change the document classification they use over time. Moreover, users may find a document relevant to more than one category, choosing usually just one, arbitrarily, to host the document. So, how can an automatic clustering system “guess” the users subjective classification? Moreover, we may see in organizations that there is no objective, defined topic tree, which is supplied or adopted by the employees but, instead each worker, creates its unique categorization-tree according to subjective criteria. To test this issue, taking a well-defined set of categorized documents agreed upon several experts does not reflect the real problem. The motivation for the current study was to evaluate the actual use of automatic categorization techniques, to support the manual categorization in a company that performs this task on daily basis by human experts. In our experiment, the system performed a task similar to one of the human categorizers, based on his data. In this study we implemented and evaluated two artificial neural net methods that address the specific user’s preferences and deal with his/her varying subjective judgment We used a set of manually categorized documents from a specific domain of interest (documents about companies and financial news), and trained a SOM and LVQ ANN to automatically categorize them. Then, we measured the distance between the automatically generated set of clusters (or categories) and the pre-defined manual clustering. Our assumption is that if the distances are marginal, we can use the following method for automatic clustering using that ANN: A user provides a training set of pre-clustered documents, this set is used for training the system, after training is completed, the system provides an automatic clustering based on the user preferences. The present study specifically addresses the following questions: 1. How close can automatic categorization come to personal, subjective user categorization? 2. What is the effect of the training set size on automatic categorization performance? 3. What is the difference between supervised (LVQ) and unsupervised (SOM) training on the above questions? 1.2 Background Clustering is defined as the unsupervised classification of patterns into groups [5]. A wide variety of clustering methods have been presented and applied in a variety of domains, such as image segmentation, object and character recognition, data mining, and information retrieval [5]. Artificial neural networks (ANN) have been used in recent years for modeling complex systems where no explicit equations are known, or the equations are too ideal to represent the real world. The ANN can form predictive models from data available from past history. ANN training is done, by learning from known examples. A network of simple mathematical “neurons” is connected by weights. Adjusting the weights between the “neurons” does the training of the ANN. Advanced algorithms can train ANN models, as much as thousands of inputs and outputs. By analysis of the trained ANN useful knowledge can be extracted. Information retrieval and information filtering are among the various applications where ANN has been successfully tested [2]. Two main branches of ANN are presently in use today, which are distinguished by their training methods namely, supervised and unsupervised:

190

Dina Goren-Bar et al.

1.

The supervised ANN branch uses a “teacher” to train the model, where an error is defined between the model outputs and the known outputs. Error backpropagation algorithm adjusts the model connection weights to decrease the error by repeated presentation of inputs. 2. The unsupervised ANN branch tries to find clusters of similar inputs when no previous knowledge exists about the number of the desired clusters. In both cases, once the ANN is trained, it is verified by presenting inputs not used in the training. The self-organizing maps (SOM), a specific kind of ANN, is a tool that may be used for the purpose of automatic document categorization [4,6,7]. The SOM is an unsupervised competitive ANN that transforms highly dimensional data to a two dimensional grid, while preserving the data topology by mapping similar data items to the same cell on the grid (or to neighboring cells). A typical SOM is made of a vector of nodes for input, an array of nodes as an output map, and a matrix of connections between each output unit and all the input units. Thus, each vector of the input dimension can be mapped to a specific unit on a two-dimensional map. In our case each vector represents a document, while the output unit represents the category that the document is assigned to. Closely related to the SOM algorithm is the Learning Vector Quantization (LVQ) algorithm. LVQ is a supervised competitive ANN which also transforms high dimensional data to a two dimensional grid, without taking into account data topology. To facilitate the two dimensional transformation, LVQ uses pre-assigned cluster labels to data items, minimizing the average expected misclassification probability. Unlike SOM, where clusters are generated automatically based on item similarities, here the clusters are predefined. In our case the cluster labels represent the subjective categorization of the various documents supplied by the user. LVQ training is somewhat similar to SOM training. An additional requirement of the LVQ is that each output unit has a cluster label, a priori to training [6]. In order to use a clustering mechanism, such as an ANN based approach, an appropriate document representation is required. The vector space model, in which a document is represented by a weighted, vector of terms, is a very popular model in the information retrieval community [1]. This model suggests that a document may be represented by all meaningful terms included in it. A weight assigned to a term represents the relative importance of that term. One common approach for term weighting is TF (“term frequency”) where each term is assigned a weight according to its frequency in the related document [8]. Sometimes documents relations are also considered, combining TF weighing with IDF (“inverse document frequency”) into TF*IDF weights. Using the SOM algorithm relives the necessity since it already takes this combination into account document relations during the training phase. The TF method was adopted in this study in order to define meaningful terms for document representations for both, SOM and LVQ processing. 1.3 Related Work A need for personalized categorization has existed for some time now, and a great deal of work has been done in various application areas. For example, various mail filing assistants have been developed such as the MailCat system [9], which propose folders for email that are similar to the categories or labeling process performed by


191

the ANN. MailCat employs the TF*IDF algorithm, where every folder has a centroid representative vector. The MailCat system provides the user with categories (folders) from which the user chooses the appropriate one. The authors mention that MailCat “does not even require that the classifier be based upon TF-IDF”, suggesting it should be based on another representation. They rather highlight the importance of other factors other than the representation methods, such as reasonable accuracy, incremental learning, and producing a ranking of several possible categories, for personalized categorization systems. As can be understood from SOM definition and from the following results, SOM can be used to provide more automation for systems like MailCat. Clustering methods can define similar groups of documents among huge collections. When this approach is employed on the result of a search engine, it may enhance and ease browsing and selecting the relevant information. Zamir & Etzioni [11], evaluated various clustering algorithms and showed precision improvement of the initial search engine results, from 0.1- 0.4 to 0.4 – 0.7. Rauber & Merkl [7], showed that clustering, applied on general collections of documents, can result in an ordered collection of documents, grouped into similar groups, that is easily browsable by users. Other studies resembling the “one button” (one category) implementation of MailCat, reported roughly similar precision results on automatic categorization [9]. The main difference between MailCat and similar systems and the method we propose is that their systems are built to support and adapt towards specific user interests, based on the development of specific algorithms to achieve “reasonable accuracy”, while in our case we use a well-known categorization method to represent subjective, personal behavior. Therefore, the major difference between us, in the classifier used, specifically in our proving of the feasibility of an unsupervised algorithm. The different approach may have interesting implications about the generalization of results to different domains, considering that in our study we reach the same improved levels of precision reported by Zamir and Etzioni (from 0.45 – 0.75) [11]. It is also conceivable, that our system can be extended to more than two categories. The rest of this paper is structured as follows: The next section describes the experiment performed for the evaluation of the training methods, followed by the presentation of the experimental results. We conclude with a discussion and planned future work.

2 Experiment 2.1 Data The experiment was performed in a company, which deals with information extraction and categorization, in order to evaluate the possible automation of data items categorization within the company’s domain, using actual, available data. We chose this environment in order to deal with real-life data in a real-life organization. The data collection method resembled a field study. We did not asked the information expert to do an experimental task, which could have based our experiment in the sense that he would think over on each data categorization. Instead, we took the already clustered items done in natural daily conditions. A data set containing 1115

192


economics and financial data items was used. The data was extracted from an online Internet based source (Yahoo.com). An information specialist read each item and manually clustered the items into one of 15 possible clusters. The size of the cluster varied from 1 to 125 documents per cluster. This approach represents a daily, normal operation of information seeking and categorization in organizations today. Table 1. Original data clusters \ set cat \

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1

2

3 1\0

0\1 0\1 2\1 5 6 11 19\18 21\20 21 26\27 29 39 44

1\0 0\1 0\1 2 4\5 6 11 19\17 20 21 27\26 29 39\40 44

0\1 1\3 5\4 6 11 18\17 22\21 21 25 30 39\40 44

4

5

1

0\1

1 2 5 6 11 18 18\20 22\21 26 31\29 39 43\44

1 2\3 5\4 7\6 10\11 19\18 19\20 22\21 27\26 29\30 38\39 44\43

6

7

8

9

10

0\1 1 2 5\4 6 11 20\19 18\19 22\21 27 29\30 38\39 44

1 2\1 4\5 6 11 18\17 20\21 21 27\26 29 40\41 44

1 1 2\1 5 6 11 18 19\20 22 27 29 38 44

0\1 2\3 5\4 6 11 18 21\20 22\21 26 29 39\40 44

1\0 1 1 5 6 11 17\18 21 21 26 30\29 40 43\44

2.2 Method All 1115 documents were analyzed, using a classical text analysis methodology. “Stop words” were removed. The resulting terms underwent further stemming processing, using the Porter stemming algorithm [Frakes and Beiza-Yates 1992]). Finally, normalized weighted term vectors were generated to represent the documents. In order to reduce the vector length, a low frequency threshold was implemented as a means of dimensionality reduction. Non-informative terms, which appeared in less than 100 documents (less than 10%), were excluded from the matrix. Then, the SOM and LVQ simulations were implemented using the SOM Toolbox for Matlab 5 [10]. Since experimental results may be influenced by the selection of the test and training sets, ten test sets were randomly selected. The size of each was 20% of the overall set (223 items). The size of 223 documents for test set and initial training set was selected in order to allow (when available) sufficient number of examples for each original cluster to be used (an average of 15 documents from each cluster). For each test set, several training sets of different sizes were generated from the remaining 892 items, by random selection, so that the test set items were never part of the training set. Training sets sizes comprised 20%, 40%, 60% and 80% of the original data set (223, 446, 669, and 892 items). The experiment was repeated ten times for each algorithm, with a randomly pooled test set to overcome the possibility of biasing effects resulting from data sets selection.


193

The labeling process of the auto-generated clusters consisted of the following steps: First we observed the original manual classification of the documents clustered at each output unit (automatic cluster). The manual cluster with the highest frequency, indicating where most of the documents belong to, rendered its name to the automatic output unit. In order to evaluate the ANN performance two popular measures were used: precision and recall. Recall, in a specific category, is the number of documents that were automatically clustered correctly by the system, divided by the overall number of documents that belong to that cluster originally (given by the expert). Precision is the number of documents automatically clustered correctly, divided by the overall documents clustered by the system to the same category (correct + incorrect). 2.3 Set Up Maps of different sizes were generated to match the different training sets. The initial weights of each map were generated randomly just before the learning phase. Two runs were performed, one for SOM and one for LVQ. Hence, slight differences were found in the randomly selected test and training sets. Table 1 represents document distribution among the ten sets, the columns representing the test sets and the rows representing the categories. Whenever a difference appeared in a test set between the SOM and the LVQ runs, the number in the left-hand side belongs to LVQ test set while the number of the right-hand side, is from the SOM test set.

3 Experimental Results In order to evaluate the performance of the automatic classifications with each method, we computed the average precision and recall on the ten different testtraining sets. 3.1 LVQ Results The results of LVQ-based recall and precision, are summarized in Table 2. The rows represent each of the initial 15 manually defined “original” categories, while the columns represent the four different sizes of training sets. Precision and recall results are denoted “P” and “R” respectively. The results for each training-set are the average result of the ten runs. The nature of the data used (variation of categories sizes) resulted in several “empty” categories in the training/test sets (see table 1 for a summary of the actual number of documents for each category for every test set). The results show that recall improves significantly as the size of the training set increases to 60% (669 documents). For the 80% training set size the results are mixed, which may indicate that at 60% enough data is accumulated to learn and adapt the ANN to the user’s subjective references and no more data is needed. There is a noticeable difference between categories 1-7 and the other categories, which may result from the small number of documents in these categories. This difference implies that learning is not possible in categories where a training set is not available (or has a marginal size).

194

Dina Goren-Bar et al. Table 2. LVQ

\Training Set Size 80% (892) 60% (669) 40% (446) 20% (223) \ Category\ R P R P R P R P 1* 0 0 0 0 0 0 0 0 2** 0 0 0 0 0 0 0 0 3* 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 6 0.46 0.97 0.26 1 0.20 1 0.04 1 7 0.77 0.59 0.80 0.58 0.63 0.57 0.30 0.44 8 0.95 0.81 0.94 0.74 0.93 0.71 0.81 0.69 9 0.77 0.88 0.72 0.84 0.71 0.88 0.69 0.77 10 0.52 0.59 0.60 0.59 0.60 0.57 0.59 0.53 11 0.70 0.51 0.62 0.53 0.60 0.46 0.51 0.44 12 0.78 0.68 0.74 0.65 0.73 0.64 0.68 0.59 13 0.69 0.81 0.69 0.83 0.65 0.82 0.65 0.76 14 0.81 0.84 0.81 0.82 0.80 0.83 0.80 0.82 15 0.79 0.84 0.82 0.83 0.79 0.78 0.75 0.74 * All marked categories contained minimal number of documents and appeared as specific categories only in part of the 10 randomly generated sets – see Table 1 for details.

Table 3. SOM \Training

\ Category\

Set 80%

R 0

Size (892)

P

60%

R 0

(669)

P

40%

R 0

(446)

P

20%

R 0

(223) P

1* 2** 3* 0 0.33 0 0 4 0.50 0.53 0.50 0.61 0.20 0.42 0.10 0.25 5 0 0.15 0.5 0 0 0 6 0.39 0.71 0.22 0.77 0.23 0.56 0.17 0.25 7 0.47 0.45 0.35 0.51 0.35 0.49 0.40 0.45 8 0.92 0.74 0.88 0.69 0.79 0.67 0.73 0.68 9 0.68 0.87 0.67 0.86 0.68 0.86 0.52 0.83 10 0.78 0.56 0.79 0.53 0.75 0.52 0.72 0.45 11 0.53 0.61 0.50 0.58 0.47 0.61 0.41 0.50 12 0.70 0.66 0.69 0.57 0.64 0.53 0.60 0.50 13 0.66 0.80 0.66 0.78 0.62 0.77 0.57 0.66 14 0.79 0.85 0.76 0.90 0.76 0.91 0.76 0.85 15 0.79 0.72 0.74 0.71 0.70 0.71 0.67 0.69 * Marked categories appeared as categories only in part of the 10 randomly generated sets ** No such category at all (was not defined by SOM during learning sessions)


195

It is also noticeable that precision improves as the size of the training set grows, but in this instance 80% training yielded the best results, with the exception of group 6 which contained a minimal number of documents, group 11 which is a collection of un-categorized documents and group 13. These results imply that when there are enough examples, the LVQ method support personal categorization to a reasonable degree. 3.2 SOM Results The results of SOM-based recall are summarized in Table 3, where, each “original” category is represented by a row, while columns represent the four different sizes of training sets. The results for each training-set are the average result of the ten runs. The nature of the data used (variation of category size) resulted, like the LVQ results, in several “empty” categories in the training/test sets. Table 1 summarizes the actual number of categories for each test set. Results show that recall improves significantly as the size of the training set grows, with the exception of groups 1-5 that had nearly no documents. For the 80% training set size some of the results approximate the results obtained at 60%, which may mean that at 60% enough data has been accumulated and the ANN has succeeded to learn the user’s specific preference and no more data is needed. Category 2 was not represented at all by the SOM, for some unknown reason. Category 11 included all documents that could not be assigned (by the human expert) to any category, hence generated a special category of items that were not well defined. Results show, that precision also improves as the size of the training set grows, but this time the 80% training yielded the best results, with the exception of groups 1-5 that had nearly no documents and 6 and 7, that contained a minimal number of documents, and group 14. These results show that, given there are enough examples, the SOM method, (which is an unsupervised automatic system) can match personal categorization to a reasonable degree. 3.3 Overall Results and a Comparison of SOM and LVQ Methods Performing automatic clustering by SOM and LVQ save the following results (an overall average, summarized in Table 4): Recall of LVQ ranged from 49% to 60%, while recall of SOM ranged from 45% to 57%. Precision of LVQ ranges from 65% to 75% while precision of SOM ranges from 55% to 69%. Given enough training documents (a learning set of 892 documents), both methods yielded similar results of about 70% precision and about 60% recall. As expected, in the case of the 223 document training sets supervised learning yielded better initial results: 49% recall using LVQ vs. 45% recall using SOM, and 65% precision using LVQ, but only 55% precision using SOM.

196

Dina Goren-Bar et al. Table 4. Overall results

Learning set

Measure Recall Precision

Percentage

80%

60%

40%

20%

Actual size

892

669

446

223

Method LVQ SOM LVQ SOM

0.60 0.57 0.75 0.69

0.58 0.54 0.73 0.63

0.55 0.50 0.71 0.60

0.49 0.45 0.65 0.55

The experimental results becomes problematic when examining the original categories that contained a minimal number of documents, or no documents at all at some of the test-sets, as showed in Table 1. Table 5 summarizes those categories with more than 10 documents per category (categories 8-15). It can be seen that recall improves significantly while precision of SOM improves and LVQ stays the same. This result supports the assumption that categories with only a few documents actually lower the overall level of precision and recall since the system could not be trained to cluster them. Table 5. Overall results – “full“ categories

Learning set

Percentage Actual size

Measure

Recall Precision

Method LVQ SOM LVQ SOM

80%

60%

40%

20%

0.75 0.73 0.75 0.73

0.74 0.71 0.73 0.70

0.73 0.68 0.71 0.70

0.69 0.62 0.67 0.65

892

669

446

223

4 Conclusions and Future Work The purpose of this work was to test the possibility of automating the classification of subjectively categorized data sets. For that purpose we chose to work with real data from a company, which deals with information search and categorization on a daily basis. Our results show that from the initial 10 test sets (Table 1) of data, categories 13 get eliminated in the process because of their sparseness. Category 4 performed slightly better. Categories 5-7 are somewhat more problematic, probably because the small amount of documents is insufficient for training and therefore result in incorrect clustering on. In spite of these problematic data sets (that represent real life), the overall results confirm the hypothesis that it is possible to cluster (with reasonable error), a subjective classification. Performing automatic clustering using either SOM or LVQ provided an overall match of 73% to 75% recall and 73% to 75% precision for the two methods, this degree of match was achieved with a learning set of 892 documents. This level of precision and recall outperform the results reported by Segal & Kephardt [9]. Another not unexpected finding is that the supervised learning yields better results initially (69% vs. 62% recall and 67% vs 65% in precision for the 223


197

documents training set), the surprising issue though was that the overall performance is quite similar given enough documents. A major question for the adoption of automatic subjective classification systems is the size of training set required. If the training set needed is large, it is most likely that users will not use such a system. In our case, the size of the training set varied from ~200 to ~900 data items, an average of 15 to 60 data items per category. While 15 data items may be a reasonable size for a data set, bigger sizes may be problematic. However even initial small training set yields results that are quite close to the overall performance achieved by the larger set. This implies that an initial, small set (with some ongoing adaptation) may be sufficient for a start. In conclusion, despite the subjective nature of human categorization, an automatic process can perform that task, with considerable success. Given an automatic classifier just the number of clusters, as were determined by the user, and a training set, enables the system to “learn” the human classification. The idea that the system classifies items in a similar way humans do, is even more interesting in the unsupervised learning case. Usually, there exist several possible ways to classify the same set of data items but, the results show, that even in this case the machine and the user were quite close. This result supports the rationality of the human classifier, meaning that human and machine referred to similar dimensions for the performance of the task. These first results require further study. For one thing, we will try to define the various types of errors occurring during the classification process, especially automatic classification errors by the human expert, in order to validate the initial subjective classification. In the present study we compared the automatic classification to the manual one, assuming that the manual classification is 100% correct. But people are inconsistent and change their minds frequently. It would be interesting to give user the classification, with the mismatches, and let him/her decide who was right. We also intend to compare the classifiers already implemented with new ones. This will enable us to assess our overall level of performance, comparing our results with other automated methods. Finally, it might be worthwhile to investigate the correlation between the size of the initial cluster and the learning and classification process (the influence of the “sparse” clusters on the overall results).

References 1. Baeza-Yates and Ribiero-Neto (1999) Modern Information Retrieval, Addison-Wesley, 1999. 2. Boger, Z. Kuflik, T., Shapira, B. and Shoval, P. (2000) Information Filtering and Automatic Keywords Identification by Artificial Neural Networks Proccedings of the 8th Europian Conference on Information Systems. pp. 46-52, Vienna, July 2000. 3. Chakrabarti, S., Dom, B. et al. (1999). Hypersearching the Web. Scientific American, June, 54-60. 4. Honkela, T., Kaski, S., Lagus, K., and Kohonen, T. (1997). WEBSOM - Self-Organizing Maps of Document Collections. In Proceedings of WSOM'97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4-6, pages 310-315. Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland. 5. Jain, A. K., Murty, M. N., Flynn, P. J. (1999) Data Clustering: A Review, ACM Computing Surveys, Vol 31, No. 3 pp. 264-323, September 1999

198


6. Kohonen, T. (1997). Self-Organizing Maps. 2 ed., Springer-Verlag, Berlin. 7. Rauber A. and Merkl. D. (1999). Using self-organizing maps to organize document archives and to characterize subject matters: How to make a map tell the news of the world Proceedings of the 10th Intl. Conf. on Database and Expert Systems Applications (DEXA'99), Florence, Italy. 8. Salton, G., McGill, M. Introduction to Modern Information Retrieval. McGraw-Hill NewYork (1983). 9. Segal, B. R., and Kephart, J. O., (1999) MailCat: An Intelligent Assistant for Organizing EMail, Proceedings of the Third International Conference on Autonomous Agents. Pp. 276282 10. Vesanto, J., Alhoniemi, E., Himberg, J., Kiviluoto, K., & Parviainen, J. (1999). SelfOrganizing Map for Data Mining in Matlab: The SOM Toolbox. Simulation News Europe, (25):54. 11. Zamir, O., and Etzioni O. (1998), Web Document Clustering: A Feasibility Demonstartion, Proceedings of SIGIR 98, Melbourne, Australia. nd

Automating Personal Categorization Using ... - Semantic Scholar

Automating Personal Categorization Using ... - Semantic Scholar

Suggest Documents