A binary-categorization approach for classifying multiple-record web ...

15 downloads 58 Views 723KB Size Report
Documents Using Application Ontologies and a Probabilistic Model. Yiu-Kai Ng, June ..... be the cost when a document from R is incorrectly classified as being in R. .... collection of Web documents, we calculate the error rates of the rules or ...
A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using Application Ontologies and a Probabilistic Model Yiu-Kai Ng,June Tang, Michael Goodrich Computer Science Department Brigham Young University Provo, Utah 84602, U.S.A. Email: {ng, junet, mike} @cs.byu.edu

Abstract

its inability to produce ranked outputs. The vector space model (VSM), on the other hand, ranks documents by using a similarity matching strategy [9]. Documents are ranked by VSM according to the values of a similarity measure between documents in a collection and a given query. These similarity measures reflect the degree of relevance of each document in the collection and the given query. Since traditional VSMs cannot handle dependent relations among index terms, they suffer from the problem of oversimplification. Another IR model, the probabilistic model, is an adaptive model based on Bayes’ decision theory. The simplest probabilistic IR model is the so-called binary independence retrieval (BIR) model [ 111. In this model, one assumes that each document is described by the presence or absence of a designated set of index terms extracted from a query, and hence each document is represented by a binary vector z = (zl,... ,zn), where zi = 0 (or 1) indicates the abn ) index term in sence (or presence) of the ith (1 i the document. A collection of documents are ranked according to their decreasing probability of relevance to the query. In general, it is impossible to perfectly calculate the probability of relevance because of the large number of variables involved in the representation of documents in comparison with small amount of feedback data available about the relevance of documents [ 111. Thus, BIR in its naive form is rarely applied. The probability of relevance can, however, be estimated under certain assumptions on the independence of terms. Under the assumption that all terms are mutually, stochastically independent, a ranking function (or a discrimination function) [ 111, which is also called the retrieval status value [4], can be obtained. The assumptions that only pairs, triplets, quadruples, etc., of terms are independent of each other have been studied [ 1 11. However, experimental evaluations have shown that the gain from these independence assumptions does not outweigh the loss from increased estimation errors [4]. Furthermore, the representation of documents in these BIR-based models is rather

The amount of information available on the World Wide Web has been increasing dramatically in recent years. To enhance speedy searching and retrieving Web documents of interest, researchers and practitioners have partially relied on various information retrieval techniques. In this paper; we propose a probabilistic model to classib Web documents into relevant documents and irrelevant documents with respect to a particular application ontology, which is a conceptual-model snippet of standard ontologies. Our probabilistic model is based on multivariate statistical analysis and is different from the conventional probabilistic information retrieval models. The experiments we have conducted on a set of representative Web documents indicate that the proposed probabilistic model is promising in binary-categorization of multiple-record Web documents.

<

Suggest Documents