Learning to Recognize Webpage Genres - CiteSeerX

4 downloads 58949 Views 180KB Size Report
or layout) of webpages such as HTML tag frequencies, metatag frequencies, script ... and they found that the main body and anchor text parts provide the.
Learning to Recognize Webpage Genres IOANNIS KANARIS & EFSTATHIOS STAMATATOS

Dept. of Information and Communication Systems Eng. University of the Aegean Karlovassi, Samos – 83200, Greece {kanaris.i; stamatatos}@aegean.gr

Abstract Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user’s information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of the feature set that represents the style of webpages. The features we propose (character n-grams of variable length and HTML tags) are language-independent and easilyextracted while they can be adapted to the properties of the still evolving web genres and the noisy environment of the web. Experiments based on two publicly-available corpora show that the performance of the proposed approach is superior in comparison to previously reported results. It is also shown that character n-grams are better features than words when the dimensionality increases while the binary representation is more effective than the term-frequency representation for both feature types. Moreover, we perform a series of cross-check experiments (e.g., training using a genre palette and testing using a different genre palette as well as using the features extracted from one corpus to discriminate the genres of the other corpus) to illustrate the robustness of our approach and its ability to capture the general stylistic properties of genre categories even when the feature set is not optimized for the given corpus. Keywords: Web genre classification; Webpage representation; Character n-grams

1. INTRODUCTION Topic and genre are the main factors of characterizing a document. The genre of documents relates closely to a distinctive type of discourse. That is, documents of the same genre share a set of communicative purposes (purpose, form of transmission, etc.) (Swales, 1990). As a result, documents of the same genre have stylistic (rather than thematic) similarities. Unfortunately, there is no agreed upon definition of genre, there is no consensus about the complete set of different genres, and there is no established hierarchy of genres. Another confusing factor is the definition of appropriate genre labels since people coming from different communities tend to describe a document in different terms. Nowadays, webpages form new kinds of genre, taking advantage of a new communication medium and different type of interaction with the receiver. Some of the webpage genres are 1

variants of traditional genres (e.g., online newspapers) while others have no antecedents in paper documents (e.g., personal home pages, blogs) (Shepherd & Watters, 1998). So far, search engines support topic-based queries. There is only limited control over the genre of the results (e.g., Google allows different searches for the web, discussion forums, and news). Automatic webpage genre identification could significantly improve the ability of information retrieval systems by suggesting queries that better describe the user’s information need or by facilitating intuitive navigation through search results. In more detail, the contribution of genre-specific information would be twofold: • In addition to topic-related terms, users could be able to specify the type of webpage they are looking for (e.g., blog, homepage, e-shop, etc.) (Rosso, 2005). • Alternatively, the search results from a topic-related set of keywords could be grouped according to their genre in addition or in parallel to thematic clustering (Bekkerman, 2008), or filtered according to genre (Meyer zu Eissen, 2007), or ranked based on topic and genre relevance (Braslavski, 2007; Vidulin, et al., 2007), so that the users can easily navigate through them and find the type of information that matches their interests. Additionally, genre detection could also assist metadata extraction from XML repositories to enhance information management in digital libraries (Clark & Watt, 2007). Usually, webpage genre identification is viewed as a single-label classification task (Meyer zu Eissen & Stein, 2004; Lim, et al., 2005; Kennedy and Shepherd, 2005). However, it is not rare for multiple genres to co-exist in a single page. That is, a webpage can be decomposed into parts, each of which belongs to a different genre (e.g., a part of an e-shop page may display search results or news). This fact raised questions about considering webpage genre detection as a multilabel classification task, so that a webpage can be an e-shop and a search page at the same time (Santini, 2008). However, since the parts of webpages belonging to different genres are visually separated rather than arbitrarily mixed, a webpage segmentation tool could detect the main areas of the webpages based on stylistic coherence and, then, a single-label webpage identification tool could be applied to each part separately. Another difficulty is that the same webpage (or webpage section) can be described with different labels (e.g., home page, informational, entry page, etc.). This situation could also be handled by a set of single-label genre detectors trained on different genre palettes to provide these multiple labels. Studies on webpage genre identification focus on the definition of appropriate textual features that quantify the stylistic properties of genres. Following the practice of similar works on text-

2

genre detection (Kessler, et al., 1997; Stamatatos, et al., 2000), topic-neutral features can be defined on the word level (e.g., ‘function’ words, closed-class words, etc.) or be extracted after appropriate text processing, e.g., part-of-speech frequencies (Dewdney, et al., 2001). Apart from textual information, structural (HTML-based) information is usually taken into account. To this end, different sets of manually defined HTML tags have been proposed (Meyer zu Eissen & Stein, 2004; Boese & Howe, 2005; Santini, 2007). In most cases, the combination of content and structure features improves the results. The vast majority of previous approaches to webpage genre detection are based on feature sets of small size comprising features selected manually or by applying a feature selection algorithm (Meyer zu Eissen & Stein, 2004; Lim, et al., 2005; Kennedy and Shepherd, 2005; Santini, 2007, Levering, et al., 2008). However, since the web is a rapidly changing and noisy environment, the features of an efficient webpage genre classifier should be able to adapt to the current conditions of webpages. In this way, the properties of the still evolving web genres would be better captured. A manually-defined feature set requires effort to be adapted to new genres or to capture modified genre properties. On the other hand, features extracted by a feature selection algorithm are optimized for specific corpora or genre palettes. Hence they are not exportable to other corpora or different genres. Moreover, many features used to represent textual information (e.g., part-of-speech frequencies) require the application of natural language processing tools, hence they are language and domain dependent. In this paper, we propose a fully-automated approach that is able to extract the most appropriate low-level features from a given corpus. To this end, we explore the use of character n-grams for representing webpage genres. Character n-grams (contiguous n characters) are able to capture nuances of style including lexical and syntactical information. They are languageindependent, easily-extracted from any text, and robust to noisy texts. Note that certain webpage genres (e.g. blogs) are very noisy and include irregular use of punctuation and spelling errors. The character n-gram representation has also been applied to authorship identification (Keselj, et al., 2003; Stamatatos, 2006), another type of style-based classification task, with promising results. Based on an existing approach for n-gram feature selection (Houvardas & Stamatatos, 2006), we produce a variable-length character n-gram representation suitable for the task of webpage genre identification. Moreover, we compare the character n-gram model with a traditional word-based approach of comparable dimensionality examining both binary and term-frequency (TF) features. In addition to the textual information, we propose a method for automating the extraction of structural information based on HTML tags.

3

A crucial factor for estimating the applicability of the proposed approaches is objective evaluation. Unfortunately, the majority of the previous studies do not provide a reliable comparison with other approaches. The main reason for this is that, until recently, there were no publicly-available corpora for this task. Another reason is that each study focuses on a different set of genres (Meyer zu Eissen & Stein, 2004; Lim, et al., 2005; Finn & Kushmerick, 2006) or a specific genre and its sub-genres (e.g., home pages) (Kennedy & Shepherd, 2005). The proposed models are objectively evaluated in two corpora already used for webpage genre identification. Hence, we are able to compare our models with previous work based on two different genre palettes and similar evaluation techniques. A series of experiments show that our approach is quite effective and significantly improves the best reported results. Moreover, given that our approach requires high dimensionality and the evaluation corpora are small, we introduce a series of cross-check experiments to exhibit the credibility and robustness of our approach in cases where the feature set is not optimized for a particular corpus or, more interestingly, a model trained on a specific genre palette is applied to another genre palette. The rest of this paper is organized as follows. Section 2 gives an overview of the previous work on webpage genre identification. Section 3 describes the extraction of character n-gram features as well as the extraction of structural information. The corpora used in this study and the results of the experiments are presented in Section 4 while Section 5 discusses the results and provides comparison to previous work on the same corpora. Finally, Section 6 summarizes the main points of this paper and proposes future work directions. 2. PREVIOUS WORK The following review of previous work is presented in chronological order. Moreover, this review deals with research related to webpage genre classification, rather than text genre classification. Each study can be described in terms of three factors, namely: • The genre palette (the set of different genres). • The feature set used to represent the content and structure of webpages, and • The classification algorithm used to identify the most likely genre. Usually, the latter is performed by well-known statistical and machine learning algorithms (e.g., discriminant analysis, neural networks, memory-based learners, support vector machines, etc.) that have been thoroughly studied in the field of text categorization research (Sebastiani, 2002). Hence, in the rest of this section we will focus on the first two factors.

4

Lee and Myaeng (2002) were the first to use web documents in genre detection. However, their aim was to investigate whether genre-related features could help topical text classification, rather than studying the webpage genre detection problem. As a result, their genre palette was a mix of text genres and webpage genres (reportage, editorial, technical paper, critical review, personal homepage, Q&A, and product specification). The web genre palettes used so far were either general, so that to cover a large part of the web, or specific, so that to examine the properties of certain genres. Meyer zu Eissen and Stein (2004) provided a general palette of eight webpage genres (see KI-04 in Table 1) following a user study on web genre usefulness. Lim, et al. (2005) used a corpus of 15 genres (personal home page, public home page, commercial home page, bulletin collection, link collection, image collection, simple table/lists, input pages, journalistic material, research report, official materials, FAQs, discussions, product specification, informal texts). Santini (2007) reports another general palette comprising seven genres (see 7genre in Table 1). Kim and Ross (2007) enriched this collection with 24 general document genres (populated by PDF web documents). In another study, Vidulin, et al., (2007) describes a collection of 20 genre labels (see Table 13) allowing several labels to be assigned to one webpage. Although there are certain similarities in such general genre palettes (e.g. personal home page) it is not clear how each genre of one palette is associated to the genres of another palette. Concerning more specific genre palettes, Kennedy and Shepherd (2005) emphasized the discrimination between home pages and non-home pages. On a second level, they classified home pages into three categories (personal, corporate, and organization) while the identification of personal home pages was the most effective. Finn and Kushmerick (2006) investigated two genres, articles and reviews, and examined whether an article is subjective or objective and whether a review is positive or negative. However, these classes better match the task of sentiment analysis rather than the task of genre identification. Levering, et al. (2008) used four related genres (store home pages, store product lists, store product descriptions, and other store pages). Kim and Ross (2008) focused on PDF web documents and a collection of six genres (academic monograph, business report, book of fiction, minutes, periodicals, and thesis). Such focused genre palettes provide a well-defined testing ground for genre identification but in a limited scope. Moreover, the conclusions drawn from experiments on a focused genre palette may be misleading since features found to be useful in some genres are not equally effective in other genres (Boese and Howe, 2005).

5

Usually, the feature set of a webpage genre detection study is a combination of different features. To represent the plain text information, a wide variety of features have been proposed, following the practice of other text categorization tasks, including term frequencies (or a bag-ofwords representation) (Boese and Howe, 2005; Finn and Kushmerick, 2006), common (or ‘function’) word frequencies (Kennedy and Shepherd, 2005; Santini, 2007; Levering, et al., 2008), genre-prolific words (Kim and Ross, 2008), and frequencies of certain symbols, e.g., punctuation marks (Meyer zu Eissen and Stein, 2004; Santini, 2007). Such features are quite easy to compute given the availability of a tokenizer and perhaps a stemmer (Dong, et al., 2008). However, for many natural languages (e.g., Chinese, Japanese) the tokenization task is especially difficult. Additionally, more elaborate features requiring some kind of text analysis include sentence length (Finn and Kushmerick, 2006), part-of-speech frequencies (Meyer zu Eissen and Stein, 2004; Lim, et al., 2005; Boese and Howe, 2005; Finn and Kushmerick, 2006; Santini, 2007; Levering, et al., 2008), part-of-speech n-grams (Santini, 2007), named-entity (e.g., names, dates) frequencies (Meyer zu Eissen and Stein, 2004; Lim, et al., 2005), phrase length (Lim, et al., 2005), sentence types (Lim, et al., 2005), etc. These syntax-related features depend on the availability of appropriate natural language processing tools for a specific language or domain. Such tools introduce noise in the estimation of the feature values, because they are still imperfect. On the other hand, useful stylistic features may become available as a side-effect of automated text analysis. For example, Lim, et al. (2005) propose the average number of syntactic trees per sentence produced by a syntactic analyzer as a means to quantify the syntactic ambiguities. A great variety of features has been proposed also to represent the structure (i.e. presentation, or layout) of webpages such as HTML tag frequencies, metatag frequencies, script features, e.g., use of JavaScript, link analysis features, e.g., number of external links, number of email links, URL-related features, e.g. URL depth, form analysis features, e.g., number of input forms, image counts (Meyer zu Eissen and Stein, 2004; Lim, et al., 2005; Kennedy and Shepherd, 2005; Boese and Howe, 2005; Santini, 2007; Levering, et al., 2008, Dong, et al., 2008). In general, such measures can be easily extracted from webpages and provide useful information on their stylistic properties. However, they cannot be applied to a significant part of non-HTML web documents (e.g., PDF, DOC, PPT files). Moreover, so far these features have to be defined manually. Another source of information comes from the various webpage objects and their position. Lim, et al. (2005) focused on the usefulness of information in different parts of the webpages (title, body, anchor text, etc.) and they found that the main body and anchor text parts provide the most effective features. Kim and Ross, (2008) propose a transformation of the first page of PDF

6

documents to a bit map in order to identify the text regions of the page. In another recent study, Levering et al. (2008) propose the use of visual features that capture the layout characteristics of the genres. The visual features attempt to represent the rendered location and size of webpage objects. Their experiments showed that visual features improve the classification results. Some approaches to webpage genre detection use an arbitrarily-defined feature set, usually of small size comprising a few dozens of features (Meyer zu Eissen & Stein, 2004; Santini, 2007). Therefore, such approaches require a manually update of the feature set to be able to adapt to current conditions of webpage genres. Since the web is a rapidly changing environment, this procedure is of vital importance. Alternatively, other approaches begin with a large feature set comprising hundreds or thousands of features and then a feature selection algorithm is used to extract the most useful features (Lim, et al., 2005; Boese & Howe, 2005; Kennedy and Shepherd, 2005; Levering, et al., 2008). In that case, the resulting feature set is optimized for a particular corpus and a particular genre palette since features are selected according to their discriminatory ability. Hence, it is not efficient to use a feature set extracted from one corpus to another corpus. What is needed is a general and fully-automated methodology to extract stylistic features that do not depend on certain corpora, genres, or natural languages. As a result, such a feature set would be exportable, meaning that it could be extracted from one corpus of a specific genre palette and effectively be used for another corpus possibly of a different genre palette. The methodology presented in the next section aims towards this direction. 3. OUR APPROACH In order to represent the content of webpages we use two sources of information: textual and structural The procedure of extracting this information is described in the following subsections. 3.1. Character n-grams of Variable Length To extract information from the textual content of the webpages, we first remove any HTMLbased information. Then, we use character n-grams to quantify the textual information. Houvardas and Stamatatos (2006) propose a feature selection method for character n-grams of variable-length aiming at authorship identification. In this study, we use this approach in order to represent webpages as a ‘bag-of-character n-grams’. Having a big initial set of variable-length ngrams (e.g., composed by equal amounts of fixed-length n-grams), the main idea is to compare each n-gram with similar n-grams (either immediately longer or shorter) and keep the dominant n-grams based on their frequency of occurrence. All the other n-grams are discarded and they do not contribute to the classification model. To this end, we need a function able to express the 7

significance of an n-gram. We view this function as ‘glue’ that sticks the characters together within an n-gram. The higher the glue of a n-gram, the more likely for it to be included in the set of dominant n-grams. For example, the ‘glue’ of the n-gram |the_| will be higher (more frequent, thus more probable) than the ‘glue’ of the n-gram |thea|1 (less probable). It has to be underlined that the method described in this section for extracting dominant character n-grams is frequency-based. Class information is not used for selecting the features (i.e., the discriminatory ability of character n-grams is not taken into account). This is in contrast to many feature selection algorithms, such as information gain or odds-ratio (Forman, 2003). As a result, the features extracted by the proposed algorithm from a given corpus can be portable to other corpora based on a different genre palette. 3.1.1. Extraction of dominant n-grams To extract the dominant character n-grams in a corpus we modified the algorithm LocalMaxs, originally introduced by Silva and Lopes (1999) for extracting multiword terms (i.e., word ngrams of variable length) from texts. It is an algorithm that computes local maxima comparing each n-gram with similar (shorter or longer) n-grams. Given that: • g(C) is the glue of n-gram C, that is, the power holding its characters together. • ant(C) is an antecedent of an n-gram C, that is, a shorter string composed by n-1 consecutive characters of C. • succ(C) is a successor of C, that is, a longer string of size n+1, i.e., composed by C and one extra character either on the left or right side of C. Then, the dominant n-grams are selected according to the following rules: if (C.length > 3) g (C ) ≥ g (ant (C )) ∧ g (C ) > g (succ(C )), ∀ant (C ), succ(C ) if (C.length = 3) g (C ) > g (succ(C )), ∀succ(C )

(1)

In this study, we only consider 3-grams, 4-grams, and 5-grams as candidate n-grams since they can capture both sub-word and inter-word information and keep the dimensionality of the problem in a reasonable level. Note that, according to the proposed algorithm, 3-grams are only compared with successor n-grams. Moreover, 5-grams are only compared with antecedent n-

1

We use ‘|’ and ‘_’ to denote n-gram boundaries and a single space character, respectively.

8

grams. So, it is expected that the proposed algorithm will favor 3-grams and 5-grams against 4grams. 3.1.2. Representing the glue To measure the glue holding the characters of an n-gram together we adopt the Symmetrical Conditional Probability (SCP) proposed by Silva, et al. (1999). The SCP of a bigram |xy| is the product of the conditional probabilities of each given the other: SCP (x, y ) = p ( x | y ) ⋅ p ( y | x ) =

2 p ( x, y ) p ( x, y ) p ( x, y ) ⋅ = p ( x) p( y ) p(x ) ⋅ p( y )

(2)

Given a character n-gram |c1… cn|, a dispersion point defines two subparts of the n-gram. Hence, an n-gram of length n contains n-1 possible dispersion points (e.g., if * denotes a dispersion point, then the 3-gram |the| has two dispersion points: |t*he| and |th*e|). Then, the SCP of the n-gram |c1… cn| given the dispersion point | c1… cn-1* cn| is: SCP((c1 K c n −1 ), c n ) =

p(c1 K c n ) p(c1 K c n −1 ) ⋅ p(c n ) 2

(3)

Essentially, this is used to suggest whether the n-gram is more important than the two substrings defined by the dispersion point. The lower the SCP, the less important the initial ngram. The SCP measure can be easily extended to account for any possible dispersion point (since this measure is based on fair dispersion point normalization, it will be called fairSCP). Hence, the fairSCP of the n-gram |c1… cn| is as follows: fairSCP(c1 K cn ) =

p(c1 K c n )

2

1 i = n−1 ∑ p(c1 ...ci ) ⋅ p(ci+1 ...cn ) n − 1 i =1

(4)

3.2. Structural Information The structural information of a webpage is encoded into the HTML tags used in the page. To quantify this structural information, we use a BOW-like approach based on HTML-tags instead of words. That is, we first extract the list of all the HTML tags that appear at least three times in the entire collection of webpages and represent each webpage as a vector using the frequencies of appearance of these HTML tags in that page. Then, the ReliefF algorithm for feature selection (Robnik-Sikonja & Kononenko, 2003) is applied to the produced dataset to reduce the dimensionality. This algorithm is able to detect

9

conditional dependencies between attributes and is noise tolerant. In this way, we avoid any manual definition of useful HTML tags and the whole procedure is fully-automated. In addition, the extracted HTML tags are adapted to the particular properties of a specific corpus. 4. EXPERIMENTS 4.1 Webpage Genre Corpora Although there is not yet a large reference corpus covering a wide variety of web genres, several small webpage corpora appropriate for evaluating genre identification approaches have become available (Rehm, et al., 2008). Despite the fact that there are significant differences in the methodologies used to define the genres and populating the genre categories with webpages, such publicly-available corpora provide an objective testing ground for comparing different approaches. In this paper, we make use of two corpora, already used in several previous studies. Details about these corpora2 can be found in Table 1. • 7genre: This corpus was built in early 2005 and consists of 1,400 English webpages categorized in 7 genres following the criteria of ‘annotation by objective sources’ and ‘consistent genre granularity’ (with the exception of the LISTING genre which may be decomposed into many sub-genres) (Santini, 2007). This is a balanced corpus, meaning that the webpages are equally distributed among the genres. • KI-04: This corpus was built in 2004 and comprises 1,205 English webpages classified under 8 genres. These genres were suggested by a user study on genre usefulness (Meyer zu Eissen & Stein, 2004). The distribution of webpages in genres is not balanced. Note that for some genres included in the aforementioned corpora there are great similarities (e.g., PERSONAL HOME PAGE and E-SHOP of 7genre are quite similar to PORTRAYALPRIVATE and SHOP of KI-04, respectively). However, the exact relationships between all the genres defined in these two corpora are not clearly defined. As already mentioned, several researchers used these corpora to evaluate their methods (Meyer zu Eissen & Stein, 2004; Boese & Howe, 2005; Santini, 2007; Kim & Ross, 2007; Mason, et al., 2009). Dong, et al., (2008) used a part of 3 genres from 7genre. The so far reported results based on cross-validation are summarized in Table 2.

2

Both corpora were downloaded from: http://www.nltg.brighton.ac.uk/home/Marina.Santini/

10

4.2 Words vs. Character n-grams In the first experiment, we compare the performance of the variable-length n-gram approach with a traditional bag-of-words method. In more detail, for the character n-gram approach, the initial set of features comprised equal amounts of three fixed-length character n-gram subsets (e.g., an initial feature set of 3,000 features includes the 1,000 most frequent character 3-grams, the 1,000 most frequent character 4-grams, and the 1,000 most frequent character 5-grams). The initial feature sets included 3,000 up to 24,000 n-grams, with a step of 3,000 features (i.e., 1,000 from each of the three fixed-length n-gram subsets). Then, the feature selection approach described in Section 3.1 was applied to the initial set and the resulting feature sets of variable-length character n-grams were used to represent the webpages of each corpus. Finally, a Support Vector Machine (SVM), a machine learning algorithm able to deal with high-dimensional and sparse data (Joachims, 1998), was used for the classification. As concerns the word-based approach, the most-frequent words of the corpus in question formed the set of features. Then, a SVM model was applied to build the webpage genre classifier. We examined binary as well as TF features for both schemes (i.e., character n-grams and words). Figures 1 and 2 show the classification accuracy results (based on stratified 10-fold crossvalidation) on the 7genre and KI-04 corpora, respectively, for various feature set sizes. As can be seen, the binary features are better than the corresponding TF features in all cases. Moreover, the word features are better when the dimensionality remains low. This can be explained since word features are more semantically loaded in comparison to character n-grams. However, the best results are obtained by binary character n-grams when the dimensionality exceeds 4,000 features. In other words, character n-grams require higher dimensionality in comparison to word features but provide better accuracy. Tables 3 and 4 present the confusion matrices for the 7genre corpus based on binary character n-grams of variable length and the word-based model, respectively. Similarly, Tables 5 and 6 present the corresponding confusion matrices for KI-04. 4.3 Incorporating Structural Information So far, only textual information was used. In the next experiment, the structural information extracted as described in Section 3.2 was incorporated to the character n-gram model (binary features). The evaluation results based on (microaverage) classification accuracy of stratified 10fold cross-validation for 7genre and KI-04 are presented in Figures 3 and 4, respectively. In addition to the performance of the combined model of textual and structural information, the

11

performance of the textual only model (i.e., binary character n-grams) as well as the word-based model (i.e. binary features) are also depicted for comparative purposes. Apparently, the structural information assists the model to achieve even higher performance. The most notable increase in the performance with respect to the textual model is for the relatively low dimensional feature sets (

Suggest Documents