Multidimensional Text Classification for Drug Information - IEEE Xplore

3 downloads 467 Views 543KB Size Report
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, ... classes. In contrast with traditional flat and hierarchical category models, the ...
306

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 8, NO. 3, SEPTEMBER 2004

Multidimensional Text Classification for Drug Information Verayuth Lertnattee and Thanaruk Theeramunkong

Abstract—This paper proposes a multidimensional model for classifying drug information text documents. The concept of multidimensional category model is introduced for representing classes. In contrast with traditional flat and hierarchical category models, the multidimensional category model classifies each document using multiple predefined sets of categories, where each set corresponds to a dimension. Since a multidimensional model can be converted to flat and hierarchical models, three classification approaches are possible, i.e., classifying directly based on the multidimensional model and classifying with the equivalent flat or hierarchical models. The efficiency of these three approaches is investigated using drug information collection with two different dimensions: 1) drug topics and 2) primary therapeutic classes. In the experiments, -nearest neighbor, naïve Bayes, and two centroid-based methods are selected as classifiers. The comparisons among three approaches of classification are done using two-way analysis of variance, followed by the Scheffé’s test for post hoc comparison. The experimental results show that multidimensional-based classification performs better than the others, especially in the presence of a relatively small training set. As one application, a category-based search engine using the multidimensional category concept was developed to help users retrieve drug information. Index Terms—Drug information, machine learning, natural language processing, text classification.

I. INTRODUCTION

T

HE FAST growth and dynamic change of online information have provided us a very large amount of information and led to the condition known as information overload. This phenomenon is also present in the area of drug information. Text categorization (TC) is an important tool for organizing documents into classes by applying statistical methods or artificial-intelligence techniques. By this, the utilization of the documents can be expected to be more effective. As a result, the situation of information overload may be alleviated. TC can be applied to several types of documents i.e., news [1], hypertext markup language [2], extended markup language [3], and standard generalized markup language [4], [5] documents. So far, a variety of learning techniques for TC, have been developed, including nearest neighbor classification [2], [5], Bayesian approaches [1], [4], [6], decision trees [4], linear classifiers [7], [8],

Manuscript received August 31, 2003; revised April 12, 2004. This work was supported by the National Electronics and Computer Technology Center (NECTEC) under Research Grant NT-B-22-I5-38-47-04. The authors are with the Sirindhorn International Institute of Technology, Bangkadi Campus, Thammasat University, Bangkadi, Pathumthani 12000, Thailand (e-mail: [email protected]; [email protected]; thanaruk@ siit.tu.ac.th). Digital Object Identifier 10.1109/TITB.2004.832542

neural networks [9], and support vector machines [10], [11]. Recently, some researchers [12], [13] have applied TC techniques to medical documents (e.g., MEDLINE). There are quite a few works that contribute to categorizing drug-related documents. One of the interesting characteristics of drug-related documents is that a document can be organized in different ways. In the pharmaceutical field, a drug monograph is a kind of drug document describing a drug in text form. We can organize a collection of drug monographs in at least two ways, i.e., by topics and by therapeutic classes. The multiple view of this organization becomes the main issue in this work. In the past, most previous work on TC focuses on classifying text documents into a set of flat categories [4], [5], [14]. The task is to classify documents into a predefined set of categories (or classes) where there are no structural relationships among these categories. It is difficult to browse or search documents in flat categories when there are a large number of categories. As a more efficient method, one possible natural extension to flat categories is to arrange documents in a hierarchy of topics instead of a simple flat structure. When people organize extensive data sets into fine-grained classes, topic hierarchy is often employed to make a large collection of categories more manageable. This structure is known as category hierarchy. Many popular search engines and text databases apply this structure, such as Yahoo, Google Directory, and MEDLINE. There are many recent works attempting to automate text classification based on this category hierarchy [3], [11], [15], [16]. However, with a large number of classes or a large hierarchy, there exists the problem of sparse training data in a class at the lower levels of the hierarchy. This results in decreasing classification accuracy, especially for lower classes. As another problem, the traditional category hierarchy may be too rigid for us to construct since there exist several possible category hierarchies for a data set. To cope with these problems, this paper proposes a new approach, called the multidimensional approach, for TC. Using the concept of the multidimensional category model, one can expect an improvement in classification regardless of classifiers and training set size. We present the results of applying this approach to drug-related web documents and compare it with two traditional ones, flat and hierarchical approaches. The effect of the size of a training set on each classification approach is also investigated and interpreted using statistical methods. Finally, an automatic classification system using the multidimensional category concept is applied in a search engine to help users retrieve drug information. This paper is organized as follows. In Section II, drug information is introduced. Section III presents the multidimensional category model. Three approaches of classification are described in Section IV. Section V presents classification algo-

1089-7771/04$20.00 © 2004 IEEE

LERTNATTEE AND THEERAMUNKONG: MDTC FOR DRUG INFORMATION

rithms. Section VI reports experimental results. The conclusion is drawn in Section VII. II. DRUG INFORMATION Thirty years ago, drugs were few in number and normally of low potency. Enquiries on drug therapy were traditionally referred to pharmacists who could quickly answer the majority by reference to pharmacopoeias and formularies. More recently, several factors have combined to alter this pattern. The increasing number of drugs and the more potent medicines are the reasons of the so-called therapeutic explosion. Therefore, it is hard for pharmacists to remember or understand all of them. Moreover, the more potent medicines have generally caused a higher incidence of iatrogenic disease and the literature relating to drugs has expanded at a colossal rate. As a common type of drug information, drug monographs are widely used to convey details of drugs for the health care teams and the patients. The monographs are often arranged into therapeutic classes or generic names of the drugs. The topics in monographs generally include brand names, chemical name, generic names, descriptions, clinical pharmacology, indications, dosage, administration, interaction, contraindications, adverse effects, overdosage, and so on. The main sources of drug monographs come from the standard references, e.g., American Hospital Formulary Service Drug Information (AHFS DRUG), Facts and Comparisons, Physician’s Desk Reference, and Mosby’s Drug Consult. Most pharmacists consider that the information provided by drug manufacturers (e.g., instruction and dosage regimen on label and package insert), is useful. However, such information is quite limited. Some information is provided by other sources than the manufacturers. Therefore, their monographs are considered to be less biased and are often much more clearly written. To solve the problems which health care teams face with printed drug information, an online strategy was developed. Today, several popular web sites, including Food and Drug Administration (www.fda.gov), Pharminfo (www.pharminfo.com), and RxList (www.rxlist.com) [17] provide the latest information for a professional and a patient. The volume and variety of new drug information is daunting. Diverse Internet sources of information help to increase the speed at which new information is becoming available. However, no single source of new information stands out as the best, the most comprehensive, or the most timely. Many sources must be consulted to keep current [18]. A health-care team and its patients can take advantage of online drug information by searching in the dimensions of brand names, generic names, and therapeutic categories. For the therapeutic category, there is a set of commonly used standards. According to the three-byte Hierarchical Ingredient Code (HIC3), drugs are classified into roughly 700 different classes, while the approximately 250 therapeutic categories are defined in AHFS [19]. To assist users and allow greater success in searches of therapeutic categories, a cross-index of all drugs, including their AHFS and HIC3 therapeutic categories, has been constructed. The drug information providers should classify the web pages into categories that help users retrieve data. The effective ways to classify these web pages should be studied in the direction

307

of not only classification algorithms but also category models used for classification. III. MULTIDIMENSIONAL CATEGORY MODEL Category is a powerful tool to manage a large number of text documents. By grouping text documents into a set of categories, it is possible for us to efficiently keep or search for the information we need. At this point, the structure of categories, called the category model, becomes one of the most important factors that determine the efficiency of organizing text documents. In the past, two traditional category models, called flat and hierarchical category models, were applied. However, these models have a number of disadvantages as follows. The flat category model, has difficulty with browsing or searching the categories when the number of categories becomes larger. In the hierarchical category model, constructing a good hierarchy is a complicated task. There are several possible hierarchies for a document set. Since a hierarchy is basically static, browsing and searching documents along the hierarchy is always done in a fixed order, from the root to a leaf node. Therefore, searching flexibility is lost. As an alternative to flat and hierarchical category models, the multidimensional category model is introduced in this work. The proposed model is an extension of a flat category model, where documents are not classified into a single set of categories; instead they are classified into multiple sets. Each set of categories can be viewed as a dimension in the sense that documents may be classified into different kinds of categories. The multidimensional category model has the following merits compared to the other two category models. First, it is more natural than a flat category model in the sense that a document could be classified based not on a single criterion but multiple criteria. Second, in contrast with a hierarchical category model, it is possible for us to browse or search documents flexibly without an order constraint defined in the structure. Lastly, the multidimensional category model can be directly transformed to and represented by the flat or hierarchical category models; even the converses are not always intuitive. This section describes a way to apply the multidimensional category model to multidimensional data, e.g., a drug monograph. In Fig. 1, a drug monograph is composed of three topics in the first dimension [i.e., Pharmacology (P), Indications (I), Warnings (W)] and two therapeutic classes in the second dimension [i.e., Chemotherapy (1) and Central Nervous System (2)]. For example, a quinine monograph is composed of three parts: P1, I1, and W1 with each part representing each topic in the monograph. P1 means pharmacology of a chemotherapy drug, i.e., quinine. This figure shows the detail of P1 and the representations for classifying P1 based on four category models: flat category model (F), hierarchical category models which begin with topic dimension (H1) or therapeutic class dimension (H2), and multidimensional model (M). IV. MULTIDIMENSIONAL TEXT CLASSIFICATION APPROACHES TC is a task of assigning a Boolean value (true or false) to , where is a each pair is a set of documents in a collection and

308

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 8, NO. 3, SEPTEMBER 2004

being classified belongs to. In the flat-based approach, the -diis created mensional model for each combined class by using a set of training data that have already been assigned for the first, second, , with the labels th dimensions. The granularity (difference of terms for categories) of the derived flat categories is finer than the original multidimensional categories since all combinations of classes in the dimensions are enumerated. This fact implies that a flat category represents the class more precisely than a multidimensional category and then one can expect high classification accuracy. However, on the other hand, the number of training data (documents) per class is reduced. As a consequence, flatbased classification may be faced with the sparseness problem of training data. This may make it harder to classify documents and then reduce the classification accuracy. In the view of computational cost, a test document has to be compared to all enu, resulting in high computation. merated classes, i.e., B. Hierarchical-Based Approach

Fig. 1. Four category models (F, H1, H2, and M) for classifying a drug monograph based on topics and therapeutic classes.

set of predefined categories. A value of True or False is assigned to when the document is determined to belong to the category . In the training phase, the task is to approximate the unknown target function that describes how documents should be classified. Although the task of multidimensional text classification (MDTC) is similar to that of conventional TC, there is an additional issue we need to consider. The issue concerns how the classification tasks for several dimensions perform. The classification can be made directly on the multidimensional category model or its equivalent flat or hierarchical models. In another viewpoint, the decision about classification made on a dimension can be determined at once (flat model), or dimension by dimension in order (hierarchical models), or dimension by dimension without any fixed order (multidimensional model). This section gives an analysis and formulation of these three approaches. A. Flat-Based Approach The simple method for classifying documents according to a multidimensional model is to transform a multidimensional category model to an equivalent flat category model and then apply conventional classification methods directly to the derived flat categories. The flat category model can be derived by enumerating all combinations of classes among dimensions. The training phase of text classification based on this approach is , to find an optimized function imitating an MDTC function , where is a set of all combinations of classes among dimensions. That is, , where is a combined -dimensional category of . The classification is which the document to find the best -dimensional class

As the second approach, hierarchical-based classification is introduced to classify a text downward from the root to a leaf in the hierarchy. A multidimensional category model can be transformed into a hierarchical category model and then we can apply the standard hierarchical classification on the derived hierarchical model. However, there are several possible hierarchical models generated from a multidimensional model due to the order of dimensions as described in Section III. The classification is held along the hierarchy from the root to a leaf. Given a document, the decision of its class for each dimension is made step by step. First, the class of the document for the dimension located at the level immediately under the root is determined. Next, based on the resulting class, the class for the second level is computed. This action is repeated until reaching the leaf level. Due to this sequential property, the can be transclassifier formed into the subfunctions: , where is a classifier for the th dimension, is a function to is a comjoin the results from these classifiers, and bined -dimensional category among . This function has a property that makes the classification decision of the th dimension depend on the classification decisions of the preceding th dimensions. The classification is for a document, to find the best -dimensional class by considering step by step from the first dimension to the last dimension, in order. In the training phase, the models of are calculated from the training set. A class model (or node) at a level closer to the root will have a coarser granularity. This makes such class models represent classes less imprecisely but there are more training data (documents) for these models. On the other hand, class models (or nodes) near leaves will have finer granularity and then have more precise representation but have less training data. The classification process and the acquired accuracy will vary with the order of dimensions in the hierarchy. In classification phase, the number of classes for comparing with test document is .

LERTNATTEE AND THEERAMUNKONG: MDTC FOR DRUG INFORMATION

C. Multidimensional-Based Approach It is possible to directly classify a document using the multidimensional category model. The class of the document for each dimension is determined independently. We called this the multidimensional-based approach. In can be decomposed into this approach, the classifier a set of subfunctions: , is a classifier for the th dimension and is a where function to join the results from classifiers. There are classifiers for dimensions. Furthermore, each of them is independently constructed without any effect from the other dimensions. In this approach, the granularity of categories is coarser than both flat-based and hierarchical-based approaches. For each dimension, it classifies a document based on categories in that dimension instead of classifying it into the set of finer flat categories that are generated by computing all combinations of categories among dimension, as shown in Section III. Even with a disadvantage of not precisely representing any finer categories, the number of training documents per class in this approach is relatively large. As a consequence, the multidimensional approach is likely to gain high classification accuracy of each dimension and results in high accuracy for the overall classification accuracy even when there are a small number of training data. It also performs faster than flat-based classification since there are fewer classes to be compared, i.e., which is equal to hierarchical-based approach. V. CLASSIFICATION ALGORITHMS To investigate the proposed MDTC, three popular classification algorithms for training classifiers are used. A.

309

assigns it to the class with the highest posterior probability. Basically, a document can be represented by a bag of words in that document (i.e., a vector of occurrence frequencies of words in the document). NB assumes that the effect of a word’s occurrence on a given class is independent of other words’ occurrence. With this assumption, , called a an NB classifier finds the most probable class maximum a posteriori (MAP) for the document which . As a is determined by preliminary experiment, occurrence frequency for calculating outperforms the binary the posterior probability frequency. Therefore, this method will be used in this work. C. Centroid-Based Algorithm The centroid-based (CB) algorithm is a linear classification algorithm. Only positive documents are taken into account for constructing a centroid vector of a class. The vector is normalized with the document length to a unit-length vector (or prototype vector). In the classification stage, a test document is compared with these prototype vectors by dot product (cosine measure) in order to find the nearest class. Normally, a CB classifier obtained high classification accuracy with small time complexity. In our previous work [20], we proposed techniques to improve the CB by introducing term-distribution fac. tors to term weighting, in additional to the standard In this work, we use either of the following two term weighting and formulas: , later called TCB1 and TCB2, respecand are average class tively. From these formulas, term frequency and root mean square of the number of term in class , respectively. The and are inverse document frequency and standard deviation of term , respectively. These two term weights worked well in our preliminary experiments.

-Nearest Neighbor Algorithm

As an instance-based algorithm, the -nearest neighbor classifier ( -NN) calculates most similar documents of the test document being classified. The similarity of this document to a class is computed by summing up the similarities of documents among the documents, whose classes are equivalent to such a class. The test document is assigned the class that has the highest similarity to the document. Three parameters involved are the representation of documents, the similarity function and the number . In this work, we define term weighting in both training documents and testing documents using where tf, , and idf are the term frequency, maximum term frequency of the focused document, and inverse document frequency, respectively. This formula was used in several works, such as [5] and [20]. It is a form of weak normalization. The similarity function is a simple dot product. The parameter can be determined by experiments. It is usually set to a number between 20–50 in normal applications. In this work, is set to 20. B. Naïve Bayes Algorithm As a statistical-based algorithm, the naïve Bayes (NB) of classifier first calculates the posterior probability that the document belongs to different classes, and class

VI. EXPERIMENTAL RESULTS To evaluate our multidimensional model, a drug information collection (DI) is used. DI is a collection of web documents that have been collected from www.rxlist.com. RxList was created by Neil Sandow, a hospital pharmacist. This site contains several drug monographs. This collection is composed of 640 drugs. In this web site, a monograph of a drug is separated into seven topics with seven corresponding documents. Therefore, each document represents one topic of one drug. The topics in monographs (the first dimension) are: adverse drug reaction, clinical pharmacology, description, indications, overdose, patient information, and warnings. The total web pages are pages. Moreover, we manually grouped the drugs according to a primary therapeutic class, resulting in five classes (the second dimension): chemotherapy, neuro-muscular system, cardiovascular and hematopoeitic, hormone, and respiratory system. The MDTC is tested using four classifiers: -NN, NB, TCB1, and TCB2 with the environments described in Section V. All experiments were performed with ten-fold cross validation. That is, a data set is divided into ten equal subsets, and then testing is performed ten times, each for a subset when the remaining nine subsets are kept as a training set. The performance was measured by classification accuracy defined as the ratio of the number of documents assigned with correct classes to the total number of

310

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 8, NO. 3, SEPTEMBER 2004

TABLE I CLASSIFICATION ACCURACY OF FLAT-BASED APPROACH

test documents. As preprocessing, some stop words (e.g., a, on) and all tags (e.g., ) were omitted from documents to eliminate the affect of these common words and typographic words. Two sets of experiments are performed. The first set of experiments is to investigate three approaches of MDTC: flat-based, hierarchical-based, and multidimensional-based approaches. We consider perfect classification accuracy, i.e., the cases that the classes of all dimensions are correctly classified. In the second experiment, we investigate the effect of training size on each classification scheme. Due to the fact that two factors are involved in this experiment, i.e., classification approaches and classification algorithms, two-way analysis of variance (ANOVA) is used as a statistical method for evaluation. However, the objective of the experiment is to investigate the performance of the three classification approaches, not the performance of the classifiers. The difference of average classification accuracy of the multidimensional approach is compared to the other approaches by Scheffé’s test, a method which is suitable for both multiple comparison and range test. Furthermore, the concept of MDTC is applied to a search engine for searching drug information. A. Flat-Based Approach In this experiment, test documents are classified into the most specified classes say , which are the combinations and . Therefore, the number of of two dimensions, classes equals to the product of the numbers of classes of all classes. A test document dimensions. That is, was assigned the class that achieved the highest score from the classifier applied. Table I displays the average classification accuracy and standard error of the mean (SEM) of the flat classification. Here, two measures, two-dimensional and single-dimensional accuracy, are taken into account. shows the two-dimensional accuracy where In the table, the test document is completely assigned to the correct class. and , the single-dimensional accuracy, means the accuracy of the first and second dimensions where the and dimensions are generated from the result classes in , respectively. Note that single-dimensional accuracy class must always be higher than two-dimensional accuracy in the same environment. B. Hierarchical-Based Approach Since there are two dimensions in the data set, hierarchicalbased classification can be held in two different ways according to the classifying order. In the first version, documents are classified based on the first dimension to determine the class to

TABLE II CLASSIFICATION ACCURACY OF HIERARCHICAL-BASED APPROACH: (ABOVE) VERSUS (BELOW)

D

!D

D

!D

TABLE III CLASSIFICATION ACCURACY OF MULTIDIMENSIONAL APPROACH

which those documents belong. They are again further classified according to the second dimension using the model of that class. The other version classifies documents based on the second dimension first and then the first dimension. The results , and mean the are shown in Table II. In the tables, accuracy of the first dimension, the second dimension and the two-dimensional accuracy, respectively. expresses the accuracy of the second dimension that used the result from the first dimension during classifying the second dimenrepresents the accuracy of the first dision. mension that used the result from the second dimension during classifying of the first dimension. From the results, the hierarchical-based approach performs better than the flat-based approach in most cases. Moreover, an interesting observation is that classifying on the worse dimension before the better one yields a better result. The classification on the second dimension seems harder than classification on the first dimension. Therefore, classifying on the second dimension before the first dimension, gains better accuracy than classifying on the first before the second one. C. Multidimensional-Based Approach In this experiment, the multidimensional-based approach is investigated. Documents are classified twice based on two dimensions independently. The results of the first and second dimensions are combined to be the suggested class for a test document. The classification accuracy of multidimensional-based and mean classification is shown in Table III. In the table, the accuracy of the first and second dimensions, respectively. is the two-dimensional accuracy which is the combination of classes suggested in the first and the second

LERTNATTEE AND THEERAMUNKONG: MDTC FOR DRUG INFORMATION

Fig. 2. Effect of training set size to classification accuracy.

dimensions of the same classifiers. In addition, the result of two-dimensional accuracy can be combined from the different classifiers. To achieve better accuracy, the highest accuracy from each dimension is combined together. From the result, the multidimensional-based approach outperforms the flat-based approach and hierarchical-based and approaches in all cases. Since TCB1 is the best for , the maximum accuracy is obtained TCB2 is the best for from the combination of the best classifiers in each dimension. It is possible to use them for these dimensions. Finally, we calculate the accuracy of the case. We gain up to 81.47%. In the hierarchical-based approach, it is hard to predict which pair of classifiers performs the best for classifying in both dimensions. D. Effect of Training Set Size The aim of this experiment is to confirm that the multidimensional-based approach is very useful especially when the training set is limited. The size of a training set is varied from 20%, 40%, 60%, 80%, and 100%. The result on average two-dimensional classification accuracy of the four classifiers on flat-based approach (F), hierarchical-based approaches begin with D1 (H1), or D2 (H2) and multidimensional-based approach (M) is shown in Fig. 2. The difference of the average accuracy of all classifiers among the classification approaches are determined using two-way ANOVA. Scheffé’s test is used for testing the mean difference between multidimensional-based approach and the others. Two levels of significance are as follows: the represents a value less than 0.05 , and the stands . for value less than 0.01 From the graphs, the following observations can be made: 1) the larger the training set is, the higher the accuracy; 2) the order of the best to the worst approach is M, H2, H1, and F; and 3) higher significance is obtained when the smaller training set size is used. The result suggests that the multidimensional-based approach outperforms the others regardless of classification algorithms and training set size.

311

solution to this problem is to organize these documents into predefined classes. The benefits of multidimensional category model are high accuracy with less training set size. Especially, this phenomenon always occurs in drug information with a new pharmacological class. It is easy to classify each dimension in parallel; furthermore, the accuracy of all dimensions does not depend on the order of classificatory dimensions. Exploiting these benefits, we constructed a drug information search engine based on mnoGoSearch, an open-source software for constructing a search engine. We also embed our automatic classification module into the system. The users easily select the inPharmacy Drug teresting category (e.g., Health&Beauty Overdose Cardiovascular System), and conInformation tinue to search for related documents under that category by typing some keywords (e.g., vasodilators). Fig. 3 shows the result from the system. In the future, we can also apply the concept of MDTC, text summarization, information extraction, and so on, for constructing an automatic question-answering system for drug information. This system can classify a question into several dimensions: who asks the question (a patient or a health team); which type of questions in drug-interaction (drug-drug interaction or drug-food interaction); which topic is the question focused on (indication or adverse drug reaction). Moreover, this system is useful for automatically providing more desirable answers to users.

VII. CONCLUSION Classifying drug information into predefined classes is very useful in a health-care system. While most of traditional text classification tasks focus on the classification using a single set of classes (flat category model) or a hierarchy of classes (hierarchical category model), this work has introduced a multidimensional category model into drug information. That is, there were multiple classification criteria, in contrast with a single criterion. We explored these three category models by using four classifiers, i.e., -NN, NB, and two CB methods. The results showed that the multidimensional-based approach gains the highest accuracy. Also, the accuracy of these approaches was done on various training set sizes. We found out that with a limited training set size, classification by a multidimensional model outperforms flat and hierarchical models. Moreover, the multidimensional-based approach enables the classification process to occur in parallel, dimension by dimension. This speeds up the process of classifying documents. It is possible to use the result from the best classifier in each dimension. From these benefits, the multidimensional model is highly suitable for organizing drug information. Finally, the concept of multidimensional category model can be applied to a drug information search engine to enhance the searching capability. In our future work, we plan to explore the multidimensional-based approach in environments of larger dimensions.

E. Incorporating MDTC in a Search Engine New drugs and alternative medicine monographs are provided by several online drug information services. As accessible information increases, however, it has become more difficult to obtain the information comfortably and effectively. One

ACKNOWLEDGMENT The authors would like to thank the mnoGoSearch project for providing the software to speed up development.

312

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 8, NO. 3, SEPTEMBER 2004

Fig. 3. Result from a search engine incorporated with MDTC.

REFERENCES [1] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell, “Text classification from labeled and unlabeled documents using EM,” Mach. Learn., vol. 39, no. 2/3, pp. 103–134, 2000. [2] O.-W. Kwon and J.-H. Lee, “Text categorization based on k-nearest neighbor approach for web site classification,” Inf. Process. Manage., vol. 39, no. 1, pp. 25–44, 2003. [3] D. Tikk and G. Biró, “Experiment with a hierarchical text categorization method on the wipo-alpha patent collection,” in Proc. ISUMA-03 4th Int. Symp. Uncertainty Modeling and Analysis, College Park, MD, 2003, pp. 104–109. [4] D. D. Lewis and M. Ringuette, “A comparison of two learning algorithms for text categorization,” in Proc. SDAIR-94 3rd Annu. Symp. Document Analysis and Information Retrieval, Las Vegas, NV, 1994, pp. 81–93. [5] Y. Yang, “An evaluation of statistical approaches to text categorization,” Inform. Retrieval, vol. 1, no. 1/2, pp. 69–90, 1999. [6] C. K. Adam, H. T. Ng, and H. L. Chieu, “Bayesian online classifiers for text classification and filtering,” in Proc. SIGIR-02 25th ACM Int. Conf. Research and Development in Information Retrieval, M. Beaulieu, R. Baeza-Yates, S. H. Myaeng, and K. Järvelin, Eds., Tampere, Finland, 2002, pp. 97–104. [7] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, “Training algorithms for linear text classifiers,” in Proc. SIGIR-96 19th ACM Int. Conf. Research and Development in Information Retrieval, H. P. Frei, D. Harman, P. Schauble, and R. Wilkinson, Eds., Zürich, Switzerland, 1996, pp. 298–306. [8] E.-H. Han and G. Karypis, “Centroid-based document classification: Analysis and experimental results,” Principles Data Mining Knowledge Discovery, pp. 424–431, 2000. [9] M. E. Ruiz and P. Srinivasan, “Automatic text categorization using neural networks,” in Proc. 8th ASIS/SIGCR Workshop Classification Research, E. Efthimiadis, Ed.. Washington, DC, 1997, pp. 59–72. [10] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Proc. ECML-98, 10th Eur. Conf. Machine Learning, C. Nedellec and C. Rouveirol, Eds., Chemnitz, Germany, 1998, Paper 1398, pp. 137–142. [11] S. T. Dumais and H. Chen, “Hierarchical classification of web content,” in Proc. SIGIR-00 23rd ACM Int. Conf. Research and Development in Information Retrieval, N. J. Belkin, P. Ingwersen, and M. K. Leong, Eds., Athens, Greece, 2000, pp. 256–263. [12] Y. Yang, “An evaluation of statistical approaches to MEDLINE indexing,” in Proc. AMIA-96 Fall Symp. American Medical Informatics Association, J. J. Cimino, Ed.. Washington, DC, 1996, pp. 358–362. [13] B. Ribeiro-Neto, A. H. Laender, and L. R. de Lima, “An experimental study in automatically categorizing medical documents,” J. Amer. Soc. Inform. Sci. Technol., vol. 52, no. 5, pp. 391–401, 2001.

[14] T. Joachims, “A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization,” in Proc. ICML-97 14th Int. Conf. Machine Learning, D. H. Fisher, Ed., Nashville, TN, 1997, pp. 143–151. [15] A. McCallum, R. Rosenfeld, T. Mitchell, and A. Y. Ng, “Improving text classification by shrinkage in a hierarchy of classes,” in Proc. ICML-98 15th Int. Conf. Machine Learning, San Francisco, CA, 1998, pp. 359–367. [16] M. Ruiz and P. Srinivasan, “Hierarchical text classification using neural networks,” Inform. Retrieval, vol. 5, no. 1, pp. 87–118, 2002. [17] S. T. Johnson and C. J. Wordell, “Internet utilization among medical information specialists in the pharmaceutical industry and academic,” Drug Inform. J., vol. 32, pp. 547–554, 1998. [18] C. F. Curran and M. M. Buxton, “Streamlining the information retrieval process in the drug information department,” Drug Inform. J., vol. 35, pp. 921–933, 2001. [19] S. R. Mccreadie, J. L. Stumpt, and T. D. Benner, “Building a better online formulary,” Amer. J. Health-System Pharmacists, vol. 59, no. 19, pp. 1847–1852, 2002. [20] V. Lertnattee and T. Theeramunkong, “Effect of term distributions on centroid-based text categorization,” Inform. Sci., vol. 158, pp. 89–115, 2004.

Verayuth Lertnattee received the Bachelor’s degree in pharmacy, and the Master’s degree in science (computer science) from Chulalongkorn University, Bangkok, Thailand, in 1989 and 1996, respectively. He is currently working toward the Ph.D. degree in the Information Technology Program, SIIT, Thammasat University, Pathumthani, Thailand. His research interests include data mining in medical and pharmaceutical information.

Thanaruk Theeramunkong received the Bachelor’s degree in electric and electronics, and the Master’s and the doctoral degrees in computer science from Tokyo Institute of Technology, Tokyo, Japan, in 1990, 1992, and 1995, respectively. Working at SIIT, Thammasat University, Pathumthani, Thailand, his current research interests include data mining, machine learning, natural language processing, information retrieval, and knowledge engineering.

Suggest Documents