Journal of Information Science OnlineFirst, published on December 3, 2007 as doi:10.1177/0165551507082592
A comparative study of two automatic document classification methods in a library setting
Joanna Yi-Hang Pong Run Run Shaw Library, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Ron Chi-Wai Kwok, Raymond Yiu-Keung Lau; Jin-Xing Hao and Percy ChingChi Wong Department of Information Systems, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Abstract In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.
Keywords: automatic document classification; text categorization; machine learning; k-nearest neighbours classifier; naïve Bayes classifier; library practice
Correspondence to: Ron Kwok, Department of Information Systems, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong. Email:
[email protected]
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
Copyright 2007 by Chartered Institute of Library and Information Professionals.
1
Joanna Yi-Hang Pong et al.
1.
Introduction
Document classification (e.g. categorizing books and periodicals according to some pre-defined hierarchical structures) has long been one of the important activities of current library practice [1]. Traditionally, trained human experts (e.g. librarians) take the full responsibility for cataloguing and indexing [2]. However, with the explosive growth of the number of electronic documents (e.g. e-books or Web pages) for general reference or specific research purposes, it is quite difficult, if not totally impossible, for library and information professionals to manually categorize and index documents. This is the problem of so-called information overload [3]. In fact, it is very time-consuming to classify a rich mixture of electronic documents and printed library items solely based on the manual methods [4]. To alleviate this problem and further improve current library practice, some initial ideas for applying automatic document classification methods to categorization of electronic documents in an experimental setting have been explored [5, 6]. According to Sebastiani [7], automatic text classification (categorization) is the task of building software tools to automatically assign some labels (from a set of pre-defined class labels) to a document based on some selected features of that document. Until the late 1980s, the most popular automatic text categorization method was based on the knowledge engineering approach, where a set of manually defined rules is applied to classify documents. However, the main problem of such an approach is the knowledge acquisition bottle-neck; domain experts must be available and heavily consulted in designing the classification rules. In fact, it is very time-consuming to elicit document classification knowledge even if domain experts are abundantly available, which is unlikely in the real world. In recent years, machine learning techniques have been applied to develop automated document classification systems [7–11]. The advantage of applying machine learning approaches to automated document classification is that classification knowledge can be induced automatically based on a set of training documents. There are two main processes of automatic document classification, namely document indexing and classifier training/learning. Document indexing refers to the process of mapping a document to its compact representation (e.g. a collection of words) which can be directly processed by a classifier building algorithm [12]. One common method of document representation (indexing) is based on vectors of term weights (the vector space model) [13, 14]. A term can refer to a word, a stem, or a phrase, dependent on the particular document indexing scheme [7]. There are various ways of computing term weights; one of the basic approaches is to weight a term based on the frequency of its occurrence in the document [13]. Classifier training can be conducted based on an inductive process which involves a machine learning algorithm and a collection of pre-labelled training documents. Two well-known supervised machine learning algorithms for automatic document classification are the k-nearest neighbours (KNN) algorithm and the naïve Bayes (NB) algorithm [6, 10, 15–17]. The KNN classifier [6, 10] is based on an instance-based learning/classification approach, whereas the NB classifier [6, 10] is based on probability theory. Although some machine learning methods have been studied for automated document classification, most of these methods have only been evaluated in experimental settings which involve artificial document collections rather than real library collections. In fact, there are very few operational document classification systems which are based on machine learning algorithms [18]. Where machine learning techniques are applied to document classification in the real-world, the effectiveness of these operational systems is far below that demonstrated in the experimental settings [19]. Thus, machine learning based document classification methods are still at the infancy stage in the library setting [5]. There are very few papers which discuss the use of machine learning algorithms in general, and the KNN and NB algorithms in particular, to construct automatic document classification systems in the library setting. Over the past two decades, standards and procedures for classifying library materials have been well developed [2]. Comparatively speaking, the networked computing environment and the methods for electronic document description and organization are still evolving [20]. In the literature of library and information science, the need to combine electronic documents with traditional library materials has inspired continuous discussions on the refinement of existing manual classification schemes such as the Library of Congress Classification (LCC), Library of Congress Subject Headings Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
2
Joanna Yi-Hang Pong et al.
(LCSH), Dewey Decimal Classification (DDC), and Universal Decimal Classification (UDC) [20–22]. Existing document classification schemes are well established and lead to effective manual document classification. Accordingly, these schemes have already been adopted as the standard within the library community. It is believed that conventional classification schemes such as the DDC may be refined to classify electronic documents [6]. On the other hand, the broad classification schemes being adopted by Web information providers (e.g. the Yahoo catalogue) lack rigorous hierarchical structures and clear conceptual organizations. As most machine learning based document classification methods are only evaluated based on ad-hoc electronic document classification schemes, it is crucial to evaluate the effectiveness of these methods based on well-known document classification schemes already adopted by the library community. Only a few automated document classification systems in the literature are constructed based on standard document classification schemes. The GERHARD project used the UDC scheme, while the Scorpion project employed the DDC scheme [23]. An experimental automatic document classification system was also built using the LCC scheme [22]. However, none of these document classification systems was constructed based on machine learning techniques. The focus of the aforementioned projects was the construction of the class thesauri, rather than examining effective document classification algorithms. Recently, a prototype of automated electronic document classification system has been developed by some universities in Korea [6]; a KNN classifier and the DDC document classification scheme have been used to develop the prototype system. To extend Chung and Noh’s work [6], we will examine both the k-nearest neighbours classifier and the naïve Bayes classifier for the development of an automated document classification system which can classify a mixture of electronic documents and traditional library materials according to the widely used LCC classification scheme. The main contribution of this paper is the illustration of how to apply supervised machine learning techniques to develop an operational automatic document classification system to enhance existing library practice. In particular, the specific contributions of our research work are as follows: • examining the relative merits of two widely used supervised machine learning methods for document classification in a library setting; • developing a machine learning based automatic document classification system which can categorize both electronic documents and traditional library materials according to a standard library classification scheme; • evaluating the performance of the automatic document classification systems based on real library materials and electronic documents retrieved from the World Wide Web; • developing a refined document classification scheme which is suitable for categorizing a mixture of electronic documents and traditional printed materials in typical library settings; • proposing a set of fine-tuned system parameters for applying the KNN algorithm in the construction of an operational document classification system at the library setting. The remainder of this paper is organized as follows. In Section 2, the details of two supervised machine learning algorithms for automatic document classification are presented, and in Section 3 we outline the general system architecture of the automatic document classification system for enhancing current library practice. The general approach to the evaluation of our automatic document classification system is illustrated in Section 4. A discussion of our experimental results and recommendations on how to apply the KNN algorithm to develop an operational document classification system are given in Section 5. Finally, we offer concluding remarks and describe future directions of this research work.
2.
Supervised machine learning algorithms
We focus on two supervised machine learning methods in this paper because they have been widely used for previous text categorization tasks [7]. One of them is the k-nearest neighbours classifier [6, 8, 10], and the another is the naïve Bayes classifier [8, 15, 16, 24]. Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
3
Joanna Yi-Hang Pong et al.
2.1
The KNN algorithm
The KNN method is an instance-based learning approach, as the classifier will make use of existing instances (e.g. training documents) to determine the class label of a new instance. The KNN method is also considered to be a lazy learner because prototypical class descriptions are not induced during the training stage, whereas other classifiers, such as the naive Bayes classifier or decision tree algorithm, will generalize prototypical class models during the training stage. The KNN classifier works in this way: a new instance (e.g. a document) is classified by comparing it with a set of training instances according to a predefined distance metric; the class labels of the k training instances which are closest to the new instance (i.e. the k nearest neighbours) are then used to determine the class label of the new instance. For a basic KNN classifier, the new instance is simply assigned to the majority class of the k nearest neighbours. A more sophisticated KNN algorithm will also consider the degree of similarity between the new instance and an existing training instance when the class membership of the new instance is computed. 2.1.1 Document representation Before the similarities between a new instance and the existing instances are evaluated by the KNN classifier, all the instances must be characterized by a computer-based representation scheme such that they can be manipulated by the classifier. The well-known vector space model [14] employed in the field of information retrieval (IR) is used for our document indexing. For example, common English words such as ‘a’, ‘and’, ‘the’, etc., are removed from each document according to a stop word list [14]. Since these common words cannot effectively represent the semantic content of documents, they should be filtered out before a classifier is invoked. Moreover, stemming is applied to reduce the variety of word forms to a single canonical form. The Porter stemmer is employed for word stemming [25]. To select useful terms to represent a document, the term frequency inverse document frequency (TFIDF) weighting scheme [14] is applied: N tft · log2 max tf Nt wt = 2 tfk N · log2 max tf Nk k∈d
where tft is the occurrence frequency of term t in a document, and Nt and N represents the number of documents containing term t and the total number of documents in a document collection respectively. Each document is ultimately represented by a vector of TFIDF weights, and each vector position corresponds to a particular word stem (term). The set of documents D = TR ∪ TE is usually divided into a training set TR and a test set TE. 2.1.2 Training stage Strictly speaking, the KNN method does not involve a genuine training phase. As a lazy learner, it defers the generalization of the prototypical class descriptions until the testing stage. During the training stage, the set of training documents TR is simply collected and indexed. The resulting TFIDF vectors will be used to classify new documents at the testing stage. 2.1.3 Testing stage For the KNN classifier, most of the computational work is performed during the testing stage. When a testing document is presented to the classifier, the corresponding TFIDF vector of the testing document is compared with all the TFIDF vectors of the training set. This is a very computationally expensive process if the size of the training set is large. The cosine similarity measure [14] is used to compute the similarities between the testing document vector and each of the training document vectors. The resulting similarity scores are then sorted in descending order of magnitude. The top k training documents (those with the highest similarity scores) are further examined to determine the
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
4
Joanna Yi-Hang Pong et al.
class membership of the testing document. In particular, the class labels of these k nearest neighbours are used to decide the class label of the testing document. Essentially, two factors are taken into account when the class label of the testing document is decided. Firstly, the dominant class of the k nearest neighbours is considered, and secondly the similarity values between the testing document and the k nearest neighbours are also examined. The set of class labels of the k nearest neighbours can then be ranked according to a weighted combination of these two factors. For single class label assignment, the class label which is ranked the highest is assigned to the testing document [7]. For multiple class label assignment, the top n class labels will be assigned to the testing document. The cosine similarity score employed in our KNN algorithm is defined as follows [8, 10, 11]: |T |
wxi × wyi CosSim (dx , dy ) = |T | |T | 2 (wxi ) × (wyi )2 i=1
i=1
i=1
where wxi represents the TFIDF weight of the ith term in the testing document dx ∈TE, and wyi is the TFIDF weight of the ith term in the training document dy ∈TR respectively. The set T = {t1, t2, …, tT} represents the set of terms (i.e. the vocabulary) of the document collection D. The KNN algorithm applied to the testing stage is defined as follows: – 1. compute the TFIDF vector dx for a testing document dx 2. for each document dy ∈ TR 3. let sxy = CosSim (dx, dy) 4. next 5. Sort the training documents in descending order of sxy 6. let Dk be the top k training documents sorted by descending order of sxy (i.e., k nearest neighbours) ⏐Dk⏐
7.
let class (dx) = argmaxj
Σ CosSim (d , d ) x member(d , c ), x
z
z
j
z=1
8.
2.2.
where member(dz, cj) ∈ {0, 1} indicates if dz is a member of the class cj (member(dz, cj) = 1) or not (member(dz, cj) = 0). return class(dx)
The NB algorithm
The NB classifier is based on probability theory. It first induces prototypical class models based on a set of training objects. The generalized class models are expressed in terms of existing probability distributions. When a testing object is presented, the NB classifier tries to estimate the class membership of the testing object by computing the appropriate probability. One main assumption behind the NB classifier is that the occurrences of any features ti ∈T characterizing a document dx, when viewed as random variables, are statistically independent of each other. This conditional independence assumption is expressed as follows: ⏐T⏐
P(dx⏐cj) = P(t1, …, t⏐T⏐⏐ cj) =
Π P(t ⏐c ) i =1 i
j
where T = {t1, t2, …, t⏐T⏐} is the set of features (terms) characterizing the objects (documents) in a collection D. Given a class cj, the probability of this class containing the object dx is derived from the product of the individual conditional probabilities (the conditional probability that a class is described by a feature characterizing the object).
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
5
Joanna Yi-Hang Pong et al.
2.2.1 Document representation Similar to the KNN classifier, documents need to be represented in a computable form before they are passed to the NB classifier. Standard stop word removal and stemming are applied to each document dx ∈ D in the collection. However, a sophisticated term weighting scheme is normally not applied when the NB classifier is used [7, 11, 24, 26], and a binary vector is used to represent a document. If a term ti ∈ T appears in a document, a ‘1’ will appear in the corresponding position of the binary document vector; otherwise the value is ‘0’. In other words, a binary term weighting function
BTW (ti , dj ) =
1 if ti ∈ dj 0 if ti ∈ / dj
is applied to evaluate the weight of term ti for document dx. 2.2.2 Training stage The training stage is crucial for a NB classifier since the decision making model for document classification is induced at this stage. In particular, the initial probabilities such as P(cj) and P(ti ⏐ cj) are estimated based on a set of training documents TR, where cj ∈ C is one of the set of pre-defined categories C. More specifically, the probability of P(cj) is estimated using {dx ∈ TR | member(dx , cj ) = true} P(cj ) = |TR| and the conditional probability P (ti ⏐ cj) is estimated using P(ti | cj ) =
{dx ∈ TRj | BTW (ti , dx ) = 1} |TRj |
where TRj = {dx ∈ TR ⏐ member(dx, cj) = true} is the set of training documents with the class label cj. Since a term ti ∈ T from the set of all possible terms T of a given collection D may not appear in the training set TR, a smoothing procedure should be applied to estimate the conditional probabilities of these rare terms; otherwise the conditional probability of a testing document which contains a rare term will always equal zero. In other words, the NB classifier is not able to classify such a document. To alleviate the zero conditional probability problem, the Laplace smoothing method with m-estimate [26] is adopted by our NB classifier: (|TRij | + mp) P(ti | cj ) = (|TRj | + m) where P = 1/⏐Τ⏐ and m =⏐Τ⏐. In addition, TRij is defined by TRij = {dx ∈TRj ⏐ BTW(ti, dx) =1}. Accordingly, our NB algorithm is defined as follows:
5. 6.
let T = {t1, t2, …, t⏐T⏐} be the vocabulary (the set of terms) of the document collection D for each category cj ∈ C let TRj = {dx ∈TR ⏐ member (dx, cj) = true} |TRj | computer P(cj ) = |TR| for each term ti ∈T let TRij = {dx ∈TRj ⏐ BTW(ti, dx) = 1}
7.
let P(ti | cj ) =
8. 9.
next
1. 2. 3. 4.
(|TRij | + 1) (|TRj | + |T |)
next
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
6
Joanna Yi-Hang Pong et al.
2.2.3 Testing stage The testing stage of the NB classifier is relatively simple. The objective is to compute the conditional probability P(cj ⏐ dx) which indicates the probability of an arbitrary document dx belonging to the class cj. The conditional probability can be derived from the initial probabilities according to Bayes’s theorem [7]: P(cj | dx ) =
P(dx | cj ) × P(cj ) P(dx | cj ) × P(cj ) = |C| P(dx ) P(dx | cj ) × P(cj ) j=1
⏐T⏐
Π
Since the conditional probabilities P(cj) and P(dx⏐cj) = P (ti⏐cj) are derived for each class cj ∈ C i=1 and for each term ti ∈ T during the training stage, the document conditional probability P (cj⏐dx) can be computed in a straightforward manner. For each testing document dx, the set of terms characterizing the document (i.e. Tx = {ti ∈ Τ ⏐ BTW (ti, dx) = 1}) is identified, to retrieve the corresponding conditional probabilities learnt during the training stage. After computing the document’s conditional probability P(cj ⏐dx) for each cj ∈ C, the category with the highest probability is assigned to the testing document. The NB algorithm employed at the testing stage is shown as follows: 1. 2.
for each test document dx let Tx = {ti ∈ Τ ⏐ BTW (ti, dx) = 1) ⏐T⏐
3. 4. 5.
3.
let j = argmaxj P(cj)
Π P (t ⏐c ), where T
i=1
i
j
x
is the set of terms characterizing dx
assign the class label cj for the document dx next
Architecture of an automatic document classifier in a library setting
An automatic document classification system is designed to assist librarians in categorizing electronic documents. With the explosive growth of the number of electronic documents available on the Internet and retrieved from digital libraries, an automatic or semi-automatic document classification system is very desirable to enhance current library practice. The proposed document classification system is underpinned by the supervised machine learning algorithms described in Section 2. Unlike previous research in automatic text categorization, where relatively broad and ad hoc classification schemes were employed [7, 8, 10, 11, 17, 24], we utilize one of the standard document classification schemes, namely the Library of Congress Classification (LCC), as the basis for automatic document classification. The LCC scheme was created at the end of the nineteenth century for the US Library of Congress and reflected comprehensive categorization knowledge of the printed materials held in the stock of this great library. Although the LCC scheme was designed specifically for the Library of Congress, many large universities and research organizations have adopted this classification scheme. The original version of the LCC scheme contains a large number of class definitions and a complex hierarchical structure, however. Such a complex categorization scheme, developed for manual document classification, may not be suitable for automatic document classification. In order to develop an operational document classification system which can process a mixture of the traditional printed materials held in most libraries and the electronic documents available on the Internet, a refined LCC scheme will be required. One of the objectives of our research is to develop a refined LCC scheme for automatic document classification, via empirical testing. Our proposal for a refined LCC scheme is contained in the Appendix. This refined LCC scheme was developed based on the experiments described in Section 5. The prototype of our web-based automatic document classification system (WADCS) was developed using Java JDK 1.5 and Java Server Pages (JSP 2.0). The Resin webserver and MySQL database management system were used to develop the web application for automatic document classification. Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
7
Joanna Yi-Hang Pong et al.
Digital Libraries
Internet
documents Scheme Tuner
Spider
Classification Scheme
Indexer
Local Library
Index
vectors
categories vectors
Classifier
Evaluator
Performance Data
classification results
Classified Documents
Directory Generator
Fig. 1.
document references
Directory
General architecture of the web-based automatic document classification system (WADCS).
The WADCS system supports five major functions: (1) web document crawling, (2) document indexing, (3) classification, (4) performance evaluation, and (5) directory generation. These functions are performed by the corresponding software components of the WADCS system: the spider, the indexer, the classifier, the evaluator, and the directory generator. The general architecture of the automatic document classification system is shown in Figure 1. The spider automatically retrieves electronic documents from the Internet or certain digital libraries. It then stores a local copy of the document, and assigns a unique file name to each of these electronic documents. The indexer will parse the local archive of the electronic documents and extract the representative terms, using a stop word list and a stemmer. The TFIDF or binary weight of each word stem extracted from a document is then computed. The weighted term vector of each document is then stored in the local index. The classifier analyses the document vectors generated by the indexer, and automatically assigns one or more class labels to a document according to the pre-defined set of categories (e.g. the LCC scheme). The classifier is underpinned by supervised machine learning algorithms, such as the KNN algorithm or the NB algorithm. The evaluator assesses the effectiveness of the classifiers by comparing the class memberships predicted by the system with the correct class memberships of testing documents specified by human experts (e.g. a librarian). The Directory Generator creates a directory of document pointers (e.g. URLs) based on the classification results, with each pointer referring to the actual Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
8
Joanna Yi-Hang Pong et al.
physical location of the electronic document (e.g. a Web page). The categorized document directories provide a convenient way for the library users to access to all the information objects pertaining to a specific subject domain. This is in fact one of the mechanisms used to alleviate the problem of information overload. Finally, a classification scheme tuner is used to continuously fine-tune the original LCC scheme, based on the classification performance of the system and the feedback of the librarian. The refined document classification scheme will be used for subsequent automatic document classification.
4.
The evaluation procedure
Evaluation of the WADCS system in general, and the two supervised machine learning algorithms in particular, was carried out in two different settings: the experimental IR setting, and the operational library setting. In each setting, a set of pre-defined categories and a document collection, divided into a training set and a testing set, were applied. The main purpose of employing the experimental IR setting, which is based on the widely used Reuters-21578 document collection, is to evaluate the general performance of the two supervised machine learning algorithms, and to ensure that the performance of our system is comparable to other text classification systems reported in the literature. After evaluating the general performance of our system, we focused on the fine-tuning of the WADCS system based on a realistic library setting, so that our system can be applied to enhance existing library practice. The experimental IR setting made use of the well-known Reuters-21578 collection, originally prepared by David Lewis for information retrieval and text categorization research [27]. There are 9603 training documents and 3299 testing documents in the Reuters-21578 collection. However, since 95 training documents were not assigned any class labels, these training documents were excluded from our experiment. As a result, only 9508 training documents were used. There are 90 pre-defined categories for various topics related to economics and finance such as ‘acquisition’, ‘trade’, ‘Australian dollars’, etc. As can be seen, this classification scheme only represents a subset of the standard classification scheme usually employed in the library setting. Even if a classifier performs well under such a setting, we cannot conclude that it will also perform well in the real library setting. After our document indexing process, 30,819 features (terms) were extracted from the collection. For the library setting, the training documents were retrieved from the real MARC records of a large international university, and the spider of our WADCS system was used to collect live web documents from the Internet to form the testing set. It is believed that the mixture of traditional library materials and the Web documents can better represent the characteristic of the current document collections held in most modern libraries. Our spider started with initial URLs for some university faculties, home pages of professional societies, or the authoritative Internet resource sites for specific subject areas. The web documents retrieved from these sites were then downloaded to our local document archive for further review and categorization by a librarian. Therefore, the ‘true class labels’ of our testing documents were created manually before our experiment began. There are 505 training documents and 254 testing documents in our library collection. The initial library classification scheme consists of 199 categories extracted from the LCC scheme. The LCC classification scheme was then fine-tuned according to our empirical testing. Finally, a refined LCC scheme which consisted of 67 categories was developed. It is believed that the refined LCC scheme can produce the optimal automatic document classification performance in realistic library settings. (The performance data of the classifiers based on different classification schemes will be discussed in Section 5.4). After document indexing, 17,233 features (terms) were extracted from our library collection. In general, the two classifiers were first trained based on the training set of a collection. Then, the testing documents were fed to the classifiers to obtain the system predicted class labels. The classification results produced by WADCS were then compared with the correct classifications created by the librarian. The WADCS system first processed the Reuters-21578 collection, followed by our library collection. The objectives of our experiments are: (1) to evaluate the effectiveness of our automatic document classification system under both a common experimental setting and a realistic library setting; (2) to develop a refined LCC scheme which is suitable for automatic Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
9
Joanna Yi-Hang Pong et al.
document classification in library settings; (3) to develop appropriate system parameters to allow the supervised machine learning algorithm to be applied to document classification in realistic library settings. The evaluation metrics of our experiments were adopted from the standard measures developed in the fields of IR and machine learning [7, 14]. The effectiveness of text categorization systems is typically measured in terms of recall, precision, F-measure, and categorical distribution. However, each measure has its merits and limitations [7]. Therefore, we would like to use a combination of these measures, rather than a single measure, to measure the effectiveness of our document classification system. The measures of recall, precision, and F-measure can be explained with respect to a contingency table (Table 1), which characterizes typical outputs from a document classification system. Table 1 A contingency table for document classification Librarian says ‘Yes’ System says ‘Yes’ System says ‘No’
Librarian says ‘No’
a (true positive) c (false negative)
b (false positive) d (true negative)
The number of correct decisions made by an automatic document classification system for a particular category is the sum of a (number of correctly classified documents) and d (number of nonrelevant documents excluded from the particular category). Precision is defined as the proportion of the categorized documents that are correctly categorized, while recall is defined as the proportion of the set of documents belonging to that category that are correctly categorized. Nevertheless, measuring the effectiveness of a document classification system based on recall or precision alone may not be appropriate; the system can trivially achieve the maximum precision by not classifying any documents to a particular category (but will achieve zero recall for that category). Therefore, a better approach is to consider both precision and recall at the same time. The F-measure is a weighted combination of precision and recall values [7]. The relative weight of precision and recall is expressed by the β parameter. If we consider precision as important as recall, the parameter β = 1 will be set. The micro-averaged F-measure is derived by first determining the total counts of a, b, c, and d for all the pre-defined categories before putting them into the F-measure formula. The advantage of the microaveraged F-measure is that the final result will not be overly influenced by the effectiveness results from rare categories [7]. The recall, precision, and F-measure metrics are formally defined by: Recall =
a a+c
Precision =
Fβ =
a a+b
(1 + β2 ) × precision × recall β2 × precision + recall Fβ=1 =
2a 2a + b + c
In addition, categorical distribution [6] can measure whether a classifier displays bias towards a particular type at category or not (e.g. popular categories or rare categories). Categorical distribution is defined by: Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
10
Joanna Yi-Hang Pong et al.
CD =
|Csys | |C|
where C is the set of pre-defined categories, and Csys is the set of categories to which the classifier has assigned documents.
5. 5.1.
Experiments and results Overview
The two supervised machine learning algorithms which underpin the development of our automatic document classification system were first evaluated using the experimental IR setting (Section 5.2), and then in the practical library setting. The effectiveness of the algorithms was measured in terms of precision, recall and micro-averaged F-measure. In addition to the microaveraged measure, we will discuss the categorical distributions of the outputs from the two classifiers in the library setting in Section 5.3. According to Chung and Noh [6], categorical distribution is another important indicator for evaluating the performance of classification algorithms. Fine tuning of the algorithm parameters for practical application of the KNN algorithm to document classification in the library setting was conducted and is discussed in Section 5.4. Finally, recommendations on applying our machine learning algorithms to enhance library practice are given in Section 5.5 5.2.
General performance in experimental IR setting
The training set of the Reuters-21578 collection was used to train both the KNN and the NB classifiers of the WADCS system. The test set was then presented to the system when the NB classifier was activated. The topic label original assigned to each Reuters document by some human experts was then used to compare with the system assigned class label of the same document in order to estimate the effectiveness of the classifier. The same testing procedure was repeated when the KNN classifier was activated. The effectiveness of the two classifiers as measured in terms of precision, recall, and micro-averaged F-measure is highlighted in Table 2. For the KNN classifier, the microaveraged F-measure is 0.716. Although this effectiveness figure is slightly lower than the previously published figures [8], our system is still considered to produce comparable performance because our experimental procedure and term weighting method are not exactly the same as those adopted in previous experience. For the NB classifier, the micro-averaged F-measure is 0.685, which is also slightly lower than that reported in previous studies [8] due to the less sophisticated term weighting method employed in our experiment. It is shown that the KNN classifier is more effective than the NB classifier according to our experiment. The micro-averaged F-measure, precision, and recall achieved by the KNN classifier are all higher than that obtained by the NB classifier. Our finding is consistent with the previous results of the evaluation of the two classifiers [7, 8, 11]. As the performance of our two classifiers is comparable to that of similar systems, it shows a positive sign that our automatic document classification system can operate effectively under experimental IR settings. Table 2 General performance of the two classifiers Experimental IR setting
KNN NB
Library setting
F-measure
Recall
Precision
F-measure
Recall
Precision
0.716 0.685
0.650 0.624
0.790 0.759
0.802 0.546
0.825 0.610
0.781 0.490
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
11
Joanna Yi-Hang Pong et al.
5.3.
General performance in the library setting
After confirming that both the KNN and NB classifiers can operate as expected in an experimental IR setting, we further evaluated the effectiveness of these classifiers in the library setting. The measure of categorical distribution was used, as well as the micro-averaged F-measure. Based on our experimental results, it seems that the KNN classifier out-performs the NB classifier in terms of both micro-averaged F-measure and categorical distribution. 5.3.1 Micro-averaged F-measure As shown in Table 1, the KNN classifier consistently out-performs the NB classifier in terms of recall, precision, and micro-averaged F-measure in the library setting. The F-measure of the KNN classifier is 0.802, while the F-measure of the NB classifier 0.546. As can be seen, the performance of the KNN classifier is significantly better than that of the NB classifier, in the library setting. The performance gap between the KNN classifier and the NB classifier is larger in the library setting than in the experimental IR setting. Our experimental results show that the KNN classifier is more suitable for automatic document classification in a library setting, whereas either classifier may be considered for document classification for the IR domain. 5.3.2 Categorical distribution Categorical distribution measures whether a classifier can handle most categories properly, or favours a few categories of a document collection [6]. Given an evenly pre-categorized document collection, a classifier should produce evenly distributed classification results. In other words, we can assess whether a classifier is practical and applicable to a particular domain based on this measure. Table 3 shows the categorical distributions produced by the two classifiers for the library setting. Table 3 Categorical distribution Total number of testing documents: 254 Total number of pre-defined classes: 67
KNN
NB
Number of classes assigned to the 254 testing documents Number of main classes used Number of sub-classes used Total number of classes used Categorical distribution
434 15 38 53 0.79
508 13 20 33 0.49
Although the total number of class assignments produced by the KNN classifier (434) is lower than that generated by the NB classifier (508), the actual number of main classes and sub-classes of the LCC scheme utilized by the KNN classifier (53) is higher than for the NB classifier (33). Class assignment refers to the number of classes assigned to a document, whereas main or sub-class used means the number of classes to which the classifier assigns documents. For instance, there are 8.2 documents assigned to a pre-defined class of the LCC scheme by the KNN classifier on average. In practice, a document can be assigned to more than one class (i.e. multi-label text categorization) [7], and our document classification system supports this. For the KNN classifier, all the k nearest neighbours of a testing document may belong to the same class. In this case, a testing document can only be assigned to one class, even if the multi-label categorization feature is enabled. For the NB classifier, it is always possible to assign more than one class to a document. As a result, it is not surprising to see that the NB classifier can generate more class assignments than the KNN classifier. However, in terms of how many classes are being assigned documents by the classifier (i.e. categorical distribution), the KNN classifier out-performs the NB classifier. While the classifiers make use of a similar number of main classes of the LCC scheme, the KNN classifier refers to far more sub-classes than does the NB classifier. The NB classifier only refers to 20 sub-classes, whereas the KNN classifier utilizes 38 sub-classes. Since our library collection is an evenly distributed document collection, the above results show that the KNN classifier is more effective than the NB classifier in the library domain. Figure 2 depicts the categorical distribution produced by the two classifiers. Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
12
Joanna Yi-Hang Pong et al.
90 80 70 60 50 40 30 20 10 0 Categories
KNN Fig. 2.
Naive Bayes
Distributions of classified documents from the two classifiers.
As shown in Figure 2, the NB classifier tends to assign most documents to a few categories and leave many categories untouched, while the KNN classifier tends to assign documents to a much wider range of categories. A closer examination of the characteristic of some of these LCC categories in our library collection pinpoints one possible reason for this difference. Table 4 highlights some of the LCC categories, the number of training documents available for these categories, and the use of these categories made by the two classifiers. It is obvious that while the NB classifier can classify many documents in the popular categories (categories with many training instances), it fails to assign any documents to the rare categories. The KNN classifier, however, performs well for both popular and rare categories. 5.4.
Parameter adjustment in the library setting
In order to enable our automatic document classification system to achieve the best performance in the library setting, we further fine-tuned the KNN classifier by adjusting some of the system parameters. For a KNN classifier, we can generally tune three parameters, namely the number of categories assigned to each document, the k value, and the similarity threshold. Although assigning more class labels to each document of the testing set may lead to a higher recall, the precision result usually becomes poorer at the same time. Therefore, the number of classes to be assigned to each document needs to be estimated correctly before we can effectively apply the KNN classifier to Table 4 Distribution in popular and rare categories Number of documents assigned by Classes or sub-classes
Number of training documents available
KNN
NB
TA HC L R QA GC QE QK RD S
39 43 29 23 25 1 1 1 1 1
74 26 12 36 16 4 5 6 8 4
78 76 64 45 32 0 0 0 0 0
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
13
Joanna Yi-Hang Pong et al.
document classification at a library. The k parameter of the KNN algorithm specifies how many similar training documents are to be considered when the system tries to predict the class label of an incoming document. While a low value of k may limit the system’s capability for generalization, a very large value of k may introduce classification errors and lower the precision [7]. Instead of employing a fixed k value, a similarity threshold may be used to establish the set of nearest neighbours. A similarity threshold specifies a minimum value which the similarity between a training document and a testing document must exceed in order for the training document to be considered as one of the nearest neighbours of the testing document. Classification experiments using various parameters for the KNN classifier were conducted for the library setting. These parameters were initially estimated based on a subset of the training set (the validation set), and then evaluated using the testing set. Table 5 summarizes our experimental results, and the bold row indicates the best combination of KNN parameters in the library setting. From our initial experiments, employing k = 3 and assigning at most two classes to each document leads to the best classification performance. Employing a low similarity threshold (e.g. 0.05) tends to generate many noisy nearest neighbours. As a result, the variable precision of the KNN classification is accepted in our experiments. Table 5 Experimental results for eight combinations of KNN parameters Classes assigned
k Value
Similarity threshold
F-measure
2 2 2 2 3 3 3 3
2 — 3 — 3 — 4 —
— 0.05 — 0.05 — 0.05 — 0.05
0.798 0.793 0.802 0.797 0.772 0.775 0.767 0.769
5.5.
A refined LCC categorization scheme
The initial LCC scheme adopted in our experiment involves 199 categories. However, we believe that some fine-tuning of this manual classification scheme is required before we can apply it to automatic document classification in practice. For example, human experts generally have no difficulty in distinguishing documents belonging to the category ‘history of education’ (LA) from the category of ‘theory and practice of education’ (LB). However, this may be too challenging a task for most state-of-the-art classifiers. Therefore, in addition to fine-tuning the parameters of the supervised machine learning algorithm, we also need to refine the LCC scheme for practical use in automatic document classification in a library setting. For instance, merging category LA with category LB to form the broader category L seems reasonable. We identified this kind of refinement requirement by running the WADCS system for the library collection using a variety of combinations of categories from the LCC scheme and comparing the micro-averaged F-measure afterwards. After our fine-tuning, a refined LCC scheme which consists of 67 categories was developed. The resulting sample of the refined LCC scheme is attached, as the Appendix. The number of pre-defined categories was reduced in the refined LCC scheme. Based on our empirical evaluation, we assert that the simplified LCC scheme is more suitable for machine learning based automatic document classification, whereas the original LCC scheme is too complicated and it is only suitable for manual document classification. Table 6 shows the classification results (micro-averaged F-measures) for the KNN classifier and NB classifier, based on the same sets of training and testing documents (our library collection). We have empirically validated many possible combinations of categories, and Table 6 only shows the performance of the two classifiers on the original LCC scheme and the best refined LCC scheme. The results show that the KNN classifier out-performs the NB classifier whether the classification tasks are conducted using the original LCC scheme or the refined LCC scheme. Once again, this demonstrates that the KNN classifier is more sophisticated and effective than the NB classifier [7], Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
14
Joanna Yi-Hang Pong et al.
Table 6 Classification performance based on the refined LCC scheme No. of pre-defined categories
KNN
NB
199 67
0.736 0.802
0.420 0.546
even when they are applied to the seldom explored library setting. In fact, the KNN classifier is more adaptive than the NB classifier, because it can cope with different library settings (e.g. different classification standard employed in libraries). The best system performance is achieved if both the KNN classifier and the refined LCC scheme are employed in the WADCS system. We make the following overall recommendations for applying a supervised machine learning algorithm to automatic document classification in library settings: Classification method: Classification scheme:
No. of classes assigned: k Value: Similarity threshold:
6.
KNN A refined LCC Scheme (see Appendix) (When the refined LCC scheme is applied to different domains, the number of sub-categories may be increased if the subject domain is specific rather than general.) 2s 3 Not used
Conclusions and future research directions
Document categorization is one of the main tasks carried out by many librarians. Nevertheless, with the explosive growth of the number of educational resources available on the Internet, it is extremely difficult, if not totally impossible, for library practitioners to categorize both electronic documents and printed materials solely using the traditional manual approach. To alleviate the problem of manual categorization in the library setting, more serious research into automatic document classification methods is desirable. Recent success in machine learning research sheds light on the possible application of advanced technology to improving current library practice. We illustrate our automatic document classification system, WADCS, which is underpinned by two supervised machine learning algorithms. The design and development of WADCS opens the door to the application of advanced information technology to enhance current library practice. Beyond the design and development of the document classification system, we further evaluate the KNN and NB classifiers in a real library setting, to bridge the gap between experimental and operational contexts. Our empirical tests show that the KNN classifier is more effective than the NB classifier in terms of both recall and precision; the KNN classifier also produces a much more evenly distributed classification than does the NB classifier. According to our initial experiments, better performance of the KNN classifier can be achieved by tuning parameters such as the maximum number of class labels assigned to each document, the k value, and the similarity threshold. As a whole, our experimental results indicate that supervised machine learning algorithms in general, and the k-nearest neighbours algorithm in particular, can be adopted to develop an effective document classification system for the library setting. Apart from generating an archive of classified documents, our automatic document classification system can also be applied to facilitate collection development in academic libraries. The system can provide an indicator to show the comprehensive subject coverage of the library collection as a whole, regardless of the formats of the materials. A refined LCC scheme, which is more suitable for automatic document classification, has also been developed. To our best knowledge, the research work reported in this paper represents one of only a few attempts to apply machine learning techniques to improve library practice. Although our research work involves the evaluation of two machine learning algorithms from the point of view of applied library practice, an overall evaluation framework for automatic document Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
15
Joanna Yi-Hang Pong et al.
classification systems operating in the library setting has not been developed. Previous evaluation of the KNN and NB algorithms using experimental IR or machine learning settings demonstrated slightly better performance than that achieved by our implementation in the library setting. One of the reasons may be that our system uses a less sophisticated term weighting and feature selection method than some others. Follow-up investigations into more sophisticated feature selection methods such as latent semantic indexing may bootstrap the performance of our current system. A larger library collection and other standard document classification schemes will also be used to further evaluate the performance of our system. Furthermore, other supervised or un-supervised machine learning algorithms can be explored to develop more effective classifiers for automatic document classification. Another direction of future research may focus on automatic document classification for a particular subject area, e.g. Class Q (Science). The LCC classes can be broken down into deeper levels, so that more specific areas of the subject can be covered. This can provide a useful tool for librarians to develop a specialized subject directory which contains pointers to all the resources related to a particular subject area.
References [1] K. Coyle, The library catalog in a 2.0 world, The Journal of Academic Librarianship 33(2) (2007) 289–291. [2] M. Paiste, Defining and achieving quality in cataloging in academic libraries: a literature review, Library Collections, Acquisitions, and Technical Services 27(3) (2003) 327–338. [3] D. Levy, Users and interaction track: memex and hypertext: to grow in wisdom: Vannevar Bush, information overload, and the life of leisure. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital libraries (1999) 281–286. [4] G. Porter and L. Bayard, Including websites in the online catalog: implications for cataloging, collection development, and access, The Journal of Academic Librarianship 25(5) (1999) 390–394. [5] P.G. Chander, R. Shinghal, B.C. Desai and T. Radhakrishnan, An expert system to aid cataloging and searching electronic documents on digital libraries, Expert Systems with Applications 12(4) (1997) 405–416. [6] Y.M. Chung and Y.-H. Noh, Developing a specialized directory system by automatically classifying web documents, Journal of Information Science 29(2) (2003) 117–126. [7] F. Sebastiani, Machine learning in automated text categorization, ACM computering surveys 34(1) (2002) 1–47. [8] T. Joachims, Text categorization with support vector machines: learning with many relevant features, Proceedings of ECML 98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998) 137–142. [9] D. Koller and M. Sahami, Hierarchically classifying documents using very few words. In: Proceedings of the 14th International Conference on Machine Learning ICML97 (1997) 170–178. [10] O.-W. Kwon and J.-H. Lee, Text categorization based on k-nearest neighbor approach for web site classification, Information Processing and Management 39(1) (2003) 25–44. [11] Y. Yang and X. Liu, A re-examination of text categorization methods, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, CA, 1999) 42–49. [12] H. Avancini, A. Rauber and F. Sebastiani, Organizing digital libraries by automated text categorization (Technical Report 2002-TR-05, Instituo di Elaborazione dell’Informazione, Pisa, 2002). [13] H.P. Luhn, The automatic creation of literature abstracts, IBM Journal of Research and Development 2(2) (1958) 159–165. [14] G. Salton and M. McGill, Introduction to Modern Information Retrieval (McGraw-Hill, New York, 1983). [15] L.D. Baker and A.K. McCallum, Distributional clustering of word for text classification, Proceedings of the 21st International Conference on Research and Development in Information Retrieval (SIGIR’98) (Melbourne, Australia, 1998) 96–103. [16] A.S. Weigend, E.D. Wiener and J.O. Pedersen, Exploiting hierarchy in text categorization, Information Retrieval 1(3) (1999) 193–216. [17] S.T. Dumais and H. Chen, Hierarchical classification of web content, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, Greece, 2000) 256–63. [18] D.D. Lewis and F. Sebastiani, Report on the workshop on operational text classification systems (OTC-01) (2001). Available at: http://www.sigir.org/forum/F2001/textClassification.pdf (accessed 28 June 2007). [19] S.T. Dumais, D.D. Lewis and F. Sebastiani, Report on the workshop on operational text classification system (OTC-02) (2002). Available at: http://www.sigir.org/forum/F2002/sebastiani.pdf (accessed 28 June 2007). [20] L.M. Chan, Exploiting LCSH, LCC and DDC to retrieve networked resources issues and challenges (2000). Available at: http://lcweb.loc.gov/catdir/bibcontrol/chan_paper.html (accessed 28 June 2007).
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
16
Joanna Yi-Hang Pong et al.
[21] D. Vizine-Goetz, Exploiting LCSH, LCC, and DDC to retrieve network resources (2001). Available at: http://www.loc.gov/catdir/bibcontrol/vizinegoetz_paper.html (accessed 28 June 2007). [22] J. Godby and J. Stuler, The Library of Congress Classification as a knowledge base for automatic subject categorization, IFLA Preconference “Subject Retrieval in a Networked Environment” (Dublin, OH, 2001). [23] G. Moller, K.-U. Carstensen, B. Diekmann and H. Watjen, Automatic classification of the World-WideWeb using the Universal Decimal Classification. In: B. McKenna (ed.) 23rd International Online Information Meeting (London, 1999) 231–237. [24] D.D. Lewis, Naive (Bayes) at forty: the independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on Machine Learning, (Chemnitz, Germany, 1998) 4–15. [25] M. Porter, An algorithm for suffix stripping, Program 14(3) (1980) 130–137. [26] T.M. Mitchell, Machine Learning (WCB/McGraw-Hill, Boston, MA, 1997). [27] D.D. Lewis, Reuters-21578 text categorization test collection (1997). Available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/ (accessed 28 June 2007).
Appendix A refined LCC scheme for automatic document classification. Class B Subclass BF Subclass BL
Philosophy, Psychology, Religion Psychology Religion
Class C
History
Class G Subclass Subclass Subclass Subclass
GC GE GR GV
Geography, Anthropology, Recreation Oceanography Environmental Sciences Folklore Manners, Recreation, Leisure
Class H Subclass HC Subclass HM
Social Sciences Economics, Industries, Finance Sociology, Socialism
Class J Subclass JZ
Political Science International Relations
Class K
Law
Class L
Education
Class M
Music
Class N Subclass Subclass Subclass Subclass
NA NB NC ND
Class P Subclass PE Subclass PL
Fine Arts Architecture Sculpture Drawing, Design Painting Language and Literature English Language and Literature Oriental Language and Literature
Class Q Subclass Subclass Subclass Subclass Subclass Subclass Subclass Subclass Subclass Subclass Subclass
Science QA Mathematics QB Astronomy QC Physics QD Chemistry QE Geology QH Biology QK Botany QL Zoology QM Human Anatomy QP Physiology QR Microbiology
Class R Subclass RB Subclass RD Subclass RE Subclass RF Subclass RG Subclass RJ Subclass RK Subclass RL Subclass RM Subclass RS Subclass RT Subclass RX
Medicine Pathology Surgery Ophthalmology Otorhinolaryngology Gynaecology and obstetrics Paediatrics Dentistry Dermatology Therapeutics Pharmacology Pharmacy and Material Medica Nursing Homeopathy
Class S Subclass SB Subclass SD Subclass SF Subclass SH Subclass SK
Agriculture Plant Culture Forestry Animal Culture Aquaculture, Fisheries, Angling Hunting Sports (continued)
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
17
Joanna Yi-Hang Pong et al.
Appendix (continued) Class T Subclass TA Subclass TH Subclass TL Subclass Subclass Subclass Subclass
TR TS TT TX
Technology Engineering Building Construction Motor Vehicles, Aeronautics, Astronautics Photography Manufactures Handicrafts, Arts and Crafts Home Economics
Class U
Military Science
Class V
Naval Science
Class Z
Biography, Library and Information Science
Note: this set of categories is a refined version of the LCC scheme. Some categories in the original LCC scheme were grouped together under other similar categories, or categories carrying broader meaning.
Journal of Information Science, XX (X) 2007, pp. 1–18 © CILIP, DOI: 10.1177/0165551507082592
18