HTML document is used for categorizing the document ... One obvious hint is just the anchor text for the ... structural HTML tag is found, a context phrase is.
is just the anchor text of the link (i.e. the text between the and tags). ... spidering Web documents, HTML structure analysis, URL categorization, weight ...
Techniques for the automatic, or semi-automatic, classification of Web pages are starting to be .... make educated guesses about how to categorize documents.
by ACAB [ACAB]. For instance Theseus placed 180 documents in the category Search Engines. ACAB instead found over 500; however many of these where.
This paper presents a novel semi-supervised learning method which can make use of intra-image semantic con- text and inter-image cluster consistency for ...
In addition, the (n-)best fitting categories from the current context as determined by the ..... Your homepage for domain name registration, web site design and ...
JACQUELINE BARRETT, AND PHILIP R. GURR. UNIVERSITY OF EXETER AND UNIVERSITY OF HERTFORDSHIRE. In three experiments, pigeons were ...
Jan 7, 2004 - contain high quality semantic clues that a purely text-based classifier can not .... TC for all four classes (Action, Comedy, Documentary, Drama).
multiple-class classification, by which a document could be classified in many ... many approaches have been proposed, automated text cate- gorization is still a ...
The number of digital documents available in corporate intranets, digital libraries ..... predict a binary label by simply taking H(x) = sign f(x). Evaluation measures ...
G. Cameron Marean, Lynne A. Werner, and Patricia K. Kuhl. Department of Speech and Hearing Sciences. Child Development and Mental Retardation Center.
These classes characterize the organization of objects in real-world scenes. ... Illustration of an idealized object categorization system incorporating Bieder-.
77 matches - State Key Laboratory for Novel Software Technology,. Nanjing .... ment results of Object Categorization on
Discrimination of Spanish and Portuguese Person Names. Zornitsa ... turn relevant answers in a more accurate and ... likely to appear with words such as book, li- ... son name categories co-occur and relate se- mantically to different words. 3.
Aug 3, 2007 - imally fragment signatures corresponding to the same email campaign. We focus on the ... words do not have to be âgoodâ in order to affect the sig- nature. Also ... a spam campaign is often sent to many different recipients with onl
Clustering is a key concept in automatic document categorization and means grouping ... Improve the user interface by facilitating navigation, inspection, and ...
{mino,masin,ono,sato,nakagawa}@r.dl.itc.u-tokyo.ac.jp. Abstract. .... some general words. 2 http://www.cs.waikato.ac.nz/ml/weka/ ..... marvel comics. 1. 3 philips.
on a Dell Inspiron laptop computer with Presentation software. The auditory stimulus consisted of a nonsense label that was presented by an experimenter.
Heermann Gull, Bobolink and European Goldfinch. We observe that birds in these sub-classes have consistent appearance. Part selection. Fisher vectors.
Xia Feng, Jianzhong Fangand Guoping Qiu. School of Computer ..... Q.Tian, P.Hong and T.S.Huang, âUpdate Relevant Image. Weights for Content-Based Image ...
linguistically extracted phrases as terms in the automatic categorization ..... rightmost column shows for each representation the improvement with respect .... should not be left out are the nouns, the others make little or no difference.
retrieval based on word-matching, which attributes concepts to text based ... is about 2.104, and such systems were applied to the biomedical domain, based ... most entries have synonyms, while the TrEMBL Release 21.12 (September 2002) ...
by Removing Outliers from Training Set*. Kwangcheol Shin, Ajith Abraham, and Sang Yong Han**. School of Computer Science and Engineering, Chung-Ang ...
The outline of the paper is as follows. The next sec- tion will give an overview of the visual features which ef- fectively capture local image statistics.
an HTML document is used for categorizing the document ... One obvious hint is just the anchor text for ... Whenever a structural HTML tag is found, a context.
Theseus: Categorization by Context Giuseppe Attardi Dipartimento di Informatica Università di Pisa, Italy [email protected]
1. Introduction The traditional approach to document categorization is categorization by content, since information for categorizing a document is extracted from the document itself. In a hypertext environment like the Web, the structure of documents and the link topology can be exploited to perform what we call categorization by context [Attardi 98]: the context surrounding a link in an HTML document is used for categorizing the document referred by the link. Categorization by context is capable of dealing also with multimedia material, since it does not rely on the ability to analyze the content of documents. Categorization by context leverages on the categorization activity implicitly performed when someone places or refers to a document on the Web. By focusing the analysis to the documents used by a group of people, one can build a catalogue tuned to the need of that group. Categorization by context is based on the following assumptions: 1. a Web page which refers to a document must contain enough hints about its content to suggest reading it 2. such hints are sufficient to classify the document. The classification task must be capable of identifying such hints. One obvious hint is just the anchor text for the link, but additional hints may be present elsewhere in a page: page title, section headers, list descriptions, etc. All these hints make up the context for the link. Categorization by context exploits both the structure of Web documents and Web link topology to determine the context of a link. Such context is then used to classify the document referred by the link.
2. Architecture The overall architecture of our system is described in the following figure.
Fabrizio Sebastiani Istituto di Elaborazione dell’Informazione, Pisa, Italy [email protected]
Site List Spidering and HTML Structure Analysis
URL Context Path URL: C1: C2: … : Cn
Category Tree
Categorization URL: N1=w1, N2=w2, … , Nn=wn
Weight combination
Catalog 2.1 Spidering and HTML Structure Analysis This task starts from a list of URLs, retrieves each document and analyzes its HTML structure. Whenever a structural HTML tag is found, a context phrase is recorded, e.g. the title within a pair , or the first portion of text after a