Finding and Classifying Web Units in Web Sites

0 downloads 0 Views 456KB Size Report
In practice, the assumption is too restrictive since a Web page itself may not carry sufficient .... http://...path/course/CS100/exams/preliminary.html ..... ific a tio n. Classify Web fragments. Classify Web units. Web units. Figure 3. Iterative Web unit ...... Observation 3 This observation is verified during our manual labelling of the.
Int. J. of Business Intelligence and Data Mining, Vol. x, No. x, xxxx

1

Finding and Classifying Web Units in Web Sites Aixin Sun School of Computer Science and Engineering University of New South Wales Sydney, NSW, Australia 2052 E-mail: [email protected]

Ee-Peng Lim School of Computer Engineering Nanyang Technological University Nanyang Avenue, Singapore 639798 E-mail: [email protected]

Abstract: In Web classification, most researcher assume that the objects to be classified are individual Web pages from one or more Web sites. In practice, the assumption is too restrictive since a Web page itself may not carry sufficient information for it to be treated as an instance of some semantic class or concept. In this paper, we relax this assumption and allow a subgraph of Web pages to represent an instance of semantic concept. Such a subgraph of Web pages is known as a Web unit. To construct and classify Web units, we formulate the Web unit mining problem and propose an iterative Web unit mining (iWUM) method. The iWUM method first finds subgraphs of Web pages using knowledge about Web site structure and connectivity among the Web pages. From these Web subgraphs, Web units are constructed and classified into categories in an iterative manner. Our experiments using the WebKB dataset showed that iWUM was able to construct Web units and classify Web units with high accuracy for the more structured parts of a Web site. Keywords:

1 1.1

Web unit, Web unit mining, Web classification

Introduction Motivation

The World Wide Web (or Web) is populated with enormous amount of information. To support better access to information on the Web, one can attempt to classify Web pages and Web sites so that they can be grouped into different

c 200x Inderscience Enterprises Ltd. Copyright

2

A. Sun and E.-P. Lim

pre-defined categories convenient to searching and browsing. This general task of classifying Web information is known as Web classification. In Web classification research, most efforts are about classifying individual Web pages from one or more Web sites into a set of flat categories [4, 8, 28] or categories organized in a hierarchy [10]. In these research, Web pages are assumed to be independent of each other and each Web page is assigned its own category label. In other words, each Web page corresponds to an instance of a category. However, in reality, Web pages from a Web site are often created together with links among them and these links carry some semantics that bind a set of Web pages to form a meaningful information unit [5, 3]. For example, one may have a set of Web pages that represent a set of presentation slides for a research seminar. These Web pages together form a complete presentation instance. Another example is a set of Web pages created by a graduate student which includes a home page and other pages about his/her hobbies, research, and publications. In these two examples, a concept instance is described by a subgraph of Web pages. It will be hard for the readers to get an overall idea of each concept instance without browsing all, if not most of the member pages. Unfortunately, such an observation has rarely been considered by the existing Web classification research. In this research, we argue that a set of Web pages that jointly represent a concept instance should be treated as an information unit and be classified together. Such a set of Web pages is known as a Web unit. Unfortunately, the Web pages that constitute a Web unit is not known in advance. Therefore the problem of Web unit mining is formulated: determining the Web pages from a Web site that constitute a Web unit and classifying these Web units into a given set of categories. Since Web units widely exists in Web sites, we believe that they are at the appropriate granularity level for Web site browsing and searching as opposed to browsing and searching Web pages. Web unit mining is different from Web page classification as it consists of both Web unit construction and classification tasks. Therefore, Web unit mining is more appropriate for the Web sites where a reasonably large number of concept instances exists. Examples of such Web sites are, company Web sites hosting staff Web pages, school and university Web sites publishing courses, instructors and research information, and Web sites of social groups and small/medium enterprises. These are Web sites that give much autonomy for people in their Web page creation/maintenance, or do not carry so much data that requires backend databases. The Web pages of these Web sites are also assumed to be manually created and maintained by a community of users. In this paper, we propose the iterative Web unit mining (iWUM) method which allows Web unit construction and Web unit classification to be carried out iteratively. To evaluate the iWUM method and compare it with the baseline method, we have developed new precision and recall measures for Web unit mining results. We have also conducted experiments using the WebKB dataset.

1.2

Contributions

We summarize our contribution in solving the Web unit mining problem as follows:

Finding and Classifying Web Units in Web Sites

3

• Definitions of Web unit and Web unit mining The notion of Web unit has been first defined in our research. We also define the Web unit mining problem, which have not been studied in the other previous Web classification research. • Iterative Web unit mining (iWUM) method The iterative Web unit mining (iWUM) method allows Web unit construction and Web unit classification to be carried out iteratively. Each iteration selects fragments of Web units (also known as Web fragments) to be combined into more complete Web units and re-assigns category labels to all Web units based on mainly the Web site structural information. This process terminates when there are no further changes to the constructed Web units and their labels, or the changes are negligible. The iWUM method performs very well on the WebKB dataset. • Web unit mining performance measures Different from text classification or Web page classification where the objects to be classified are clearly defined, Web units are constructed during the Web unit mining process. To evaluate a Web unit mining method, we have developed new precision and recall measures. The proposed measures evaluate a Web unit mining method based on two criteria: (i) how well the Web units are constructed and (ii) how well the Web units are classified. 1.3

Paper Outline

The rest of the paper is organized as follows. In Section 2, we formally define the research problem. The related works are surveyed in Section 3. In Section 4, a Web directory is proposed to describe the organization of Web pages at Web servers and observations on the Web units are discussed. The iterative Web unit mining method is proposed and discussed in detail in Section 5 followed by Web unit mining evaluation in Section 6. Our experimental results are presented in Section 7. Finally, we conclude this paper in Section 8. 2

Problem Definition

In this section, we first give two Web unit examples followed by the formally definition of Web unit. We then define the Web unit mining problem and discuss why this problem is hard. In Section 1, a Web unit is informally defined to be a set of Web pages from the same Web site that jointly present information for one category instance. That is, a Web unit is meaningful only when categories are given beforehand. Consider a classification task to classify Web pages from a university Web site to the course and faculty categories. Two example Web units are illustrated in Figure 1: a course Web unit and a faculty Web unit. The CS100 course Web unit (see Figure 1(a)) consists of 8 pages; the first page is the course’s key page (underlined) and the others provide supplementary course information. We call the latter support pages. Similarly, among the 6 pages, the first page in the Johnson faculty Web unit (see Figure 1(b)) is the key page and the remaining pages provide information about

4

A. Sun and E.-P. Lim http://...path/course/CS100/CS100.html http://...path/course/CS100/lecture-programs.html http://...path/course/CS100/instructors.html http://...path/course/CS100/officehours.html http://...path/course/CS100/exams/final.html http://...path/course/CS100/exams/preliminary.html http://...path/course/CS100/programs/program1.html http://...path/course/CS100/programs/program2.html

http://...path/user/Johnson/index.html http://...path/user/Johnson/research.html http://...path/user/Johnson/publications.html http://...path/user/Johnson/activities.html http://...path/user/Johnson/students.html http://...path/user/Johnson/teaching.html http://...path/user/Johnson/contact.html

(a) CS100 Web unit

(b) Johnson Web unit

Figure 1

Web unit examples: CS100 and Johnson

research, publication, teaching and so on. Every page in a Web unit contributes a piece of information. Since a Web unit has richer and more complete content than the individual Web pages, we argue that Web unit is a more appropriate granularity for classifying, indexing and organizing Web information. Definition 1 (Web Unit). Given a category c, and a Web site W , a Web unit ui of the category is a Web page or a set of Web pages from W that jointly provides information of an instance in c. A Web unit consists of exactly one key page and zero or more support pages. The key page of a Web unit is the one of the pages in the Web unit that is most often used as the link target when users refer to the information captured by the Web unit. The key page of a Web unit carries meaning similar to the home page of a Web site. It is common for Web page creators to provide a key page (index page or home page) for users to easily navigate a collection of Web pages, e.g., a Web unit. The key page and the set of support page(s) are denoted by ui .k and ui .s respectively. We call a Web unit one-page Web unit if it consists of the key page only, and multi-page Web unit otherwise. A link connecting any two pages of a multi-page Web unit is known as an intra-unit link. The support pages are often reachable from the key page through intra-unit links. The Web pages in a Web unit therefore form a subgraph. Nevertheless, in some cases, Web pages in a Web unit may not be connected. For example, there could be some homework assignment pages referenced directly by assignment handouts while these assignment pages are not linked from the course key page. Based on the Web unit definition, we define the Web unit mining problem. Definition 2 (Web Unit Mining). Given a collection of Web pages from a Web site W and a set of categories C, Web unit mining is to construct Web units from these Web pages and assign them the appropriate category labels. Web unit mining therefore involves two main tasks: finding the set of Web pages that form each Web unit, and classifying the constructed Web units. The problem is challenging due to several reasons: • The criteria for selecting pages for a Web unit can be subjective. For example, depending on the user perceived semantics of a course category, lecture note pages may or may not be included as part of a course Web unit. In this research, we do not assume that the exact semantics of a category must be

Finding and Classifying Web Units in Web Sites

5

given beforehand. Instead, we will rely on the link connectivity and the Web site structure information to guess the semantic boundaries of Web units. • Web pages in a Web unit may not necessarily be connected. For example, there could be some homework assignment pages referenced directly by assignment handouts. This poses some difficulties in determining the boundary of a Web unit. • One has to determine the role of each Web page, i.e., key page or support page, within a Web unit. In our research, we introduce some heuristics to distinguish them. To automatically perform Web unit mining using machine learning techniques, we assume that a set of perfect labelled Web units are provided for training purpose. The training Web units may not come from the Web sites where the Web units are to be mined. 3

Related work

The home page finding task studied in TREC-2001 [13] is closely related to our task of finding the key pages in Web units. Word features from Web pages and words appearing in the in-link anchors are not very useful in homepage finding [7, 13]. On the other hand, features like URL-type show its usefulness in this task [27, 16]. There are four URL-types defined: root, sub-root, path and file. As reported in [16], the Web pages having URLs of the first three types constitute less than 8% of the WT10g dataset but contribute more than 94% of the home pages. It is also mentioned that Web pages with file names containing ‘welcome’ and ‘home’ are likely to be home pages. These observations are helpful for us to derive heuristics to find the key pages of Web fragments. A major task in Web unit mining is Web unit classification. In our literature survey, we have noted that in the existing Web classification research, the objects to classify are either individual Web pages [4, 8, 24] or entire Web sites [20, 11]. Our Web unit classification approach utilizes techniques from Web page classification. Most Web page classification research efforts assume that the text components of Web pages provide the primary information while the other non-text components can be used to further improve the classification accuracy [4, 8, 19, 28]. Web page classification using links between Web pages, a kind of non-text components, has been proposed in [2, 4, 12, 15, 19, 29]. In our earlier work, we have also shown that using the text body, title and words associated with the in-link anchors as Web page features could yield promising Web page classification results [24]. In both Web page and Web site classifications, the objects (Web pages or Web sites) to be classified are given. This assumption does not hold for Web unit mining as Web units have to be constructed and classified within the same method. 4

Web Directory Structure and Observations on Web units

The Web unit mining problem requires Web units to be automatically identified from a given Web site. As the Web pages from a Web unit jointly present infor-

6

A. Sun and E.-P. Lim URL: http:// … path/course/CS100/CS100.html http:// … path/course/CS100/exams/final.html http:// … path/course/CS100/programs/program1.html path

hostname sg

course

edu ntu

CS100 exams programs

(a) Web folders derived from path

Figure 2

URL http://www.cais.ntu.edu.sg/home/index.jsp

cais www home

(b) Web folders derived from hostname

Web folder derivation

mation and complement each other, we believe the way Web pages are arranged within the Web site together with the connectivity among Web pages can suggest the existence of Web units and their locations. The location of a Web page is defined by its URL. A Web directory representing the structure of a Web site can be derived from the URLs of Web pages from the Web site. From a Web page’s URL of the format protocol type:://hostname [:port number] [/path] [filename], we can determine a set of Web folders from the hostname component using “.” as the delimiter and the path component using “/” as the delimiter. Example Web folders derived from path component and hostname component are shown in Figures 2(a) and 2(b) respectively. Note that the Web folders derived from the hostname component are arranged in child-parent relationships as shown in Figure 2(b). Given a set of Web pages, a Web directory is therefore a tree consisting of Web folders and Web pages as nodes, and the parentchild relationships among them in the URL as edges. Note that a Web directory is built purely from the URLs of the pages from the Web site; the links among the Web pages are not considered. With the notion of Web directory, we can state some common observations about the way Web units are usually organized within a Web directory. These observations provide us guidelines in Web unit mining, for example in grouping pages together to construct Web units, and making use of the Web directory information to derive features for Web unit classification. Observation 1 Web pages from the same Web folder are more semantically related than Web pages from different Web folders. We believe that the organization of Web pages in Web folders is similar to the organization of files in our personal computers and a folder is usually created to store semantically related Web pages. This observation suggests that Web folders can be used as potential boundaries of Web units. Observation 2 Support pages of a Web unit are usually reachable from the key page through intra-unit hyperlinks. As support pages provide complementary information to the key page, these can often be reached from the key page through some intra-unit links. In most cases, they are directly linked. For example, it is a common practice to include links from

Finding and Classifying Web Units in Web Sites

7

a graduate student’s home page to his/her research project and publication pages. It suggests that intra-Web-unit connectivity should be strong. Observation 3 The key page of a Web unit is usually found at the highest-level Web folder compared to the Web folders of other pages in the Web unit. This observation is similar to the analysis by Kraaij who concluded that the home page of a Web site is normally at the root level [16]. For example, a person’s home page, index.html, is usually located directly under the Web folder assigned to him/her. The other Web pages belonging to the person’s Web unit may be found in some sub-folders. The two Web units shown in Figure 1 illustrate this observation. Observation 4 Two Web units corresponding to the same category without recursive relationship with itself seldom have direct links between them. This observation applies to cases where there are a large number of Web units belonging to the same category. For example, it is unusual for a faculty member’s Web page to have a link to another faculty member’s Web page when they are not related in research or teaching. Nevertheless, faculty members may be indirectly related by having links to the university or department homepage. Observation 5 Multi-page Web units of the same category often reside in a set of folders, one for each Web unit and the folders are directly under a common parent folder. For large Web sites, there are often folders created for their users to publish their Web pages. These are university and school Web sites where each person is assigned a folder under a common parent folder having names such as “users”, “home” or “∼”. Observation 6 The key pages of Web units of the same category are often the link targets of a hub page which may be found at: (a) the folder where the Web units are located if the latter are one-page Web units; (b) the parent (or ancestor) folder of the folder(s) where the Web units are located if the latter are multi-paged. A hub page is often created as an index page in a large Web site to help users find information easily. For example, in a university Web site, we often have a page listing all faculty members or postgraduate students of a department. Product catalog can be another example. Note that the above observations are not necessary exhaustive and a careful study is required to investigate their validity. Nevertheless, we believe that they can be useful heuristics for identifying Web units. In our experiments, we will try to use simple statistical method to analyze whether these observations hold for the Web sites used in the experiment. 5

Iterative Web Unit Mining

In this section, we first propose a straightforward solution to Web unit mining and discuss the shortcomings of the solution. To address these shortcomings, iterative Web unit mining (iWUM) method is proposed to construct and classify Web units in an iterative manner. Each step in iWUM is discussed in detail.

8

A. Sun and E.-P. Lim

The two challenges in Web unit mining are to identify the key pages and the support pages of Web units, and to classify the Web units. A straightforward solution to Web unit mining, also known as the baseline method in this paper, is to perform Web page classification on Web pages and treat the Web pages assigned with category labels as key pages and the rest as non-key pages. Web units are then constructed by merging the non-key pages with the key pages based on some heuristics, e.g. Web directory and link connectivity. For example, non-key pages that are link targets of a key page and in the same Web folder of the key page can be merged with the key page to form a Web unit. There are two shortcomings with the baseline method. The first is that in the set of Web pages to be mined, there are possibly some non-key pages. By treating them the same as key pages during the classification process, we may result in more misclassification. The second is that the baseline method simply ignores the context information (or structural information) of the Web site. To address the first shortcoming, we group the closely-related pages together to form Web fragments. Definition 3 (Web Fragment). A Web fragment is a set of closely-related Web pages usually from the same Web folder. A Web fragment can be treated as a potential Web unit or a part of a Web unit. Each Web fragment contains exactly one key page and zero or more support pages. By grouping pages into Web fragments, all the pages in a Web fragment are classified together as one entity. Non-key pages can be embedded in some Web fragments so that they do not need to be classified in isolation, thus reducing the classification errors. To address the second shortcoming, we propose a set of features to describe how the Web fragments (Web units) are organized in the Web site. These features are used to further improve the Web unit classification. Moreover, the re-classified Web units can again be used to construct larger Web units repeating the construction-classification procedure. In such a iterative manner, better Web unit mining result can be achieved. 5.1

iWUM Overview

Our proposed iWUM is illustrated in Figure 3 and the algorithm is presented in Algorithm 1. The algorithm draws the idea from both co-training and ExpectationMaximization (EM) algorithms [6, 17, 1]. There are two phases in iWUM, namely Web fragment generation and classification. The iterative steps in the classification phase are enclosed within a dotted box. iWUM also requires a collection of training Web units that are perfectly labelled. These Web units can come from one or more Web sites. In the Web fragment generation phase, we take a collection of Web pages from one Web site and derive from them a Web directory representing the folder structure of the Web site. Once the Web directory is built, we compute the connectivity indices of the Web folders to locate the Web folders containing the key pages of Web units, and generate the Web fragments using these key pages. The Web fragment generation is described in more detail in Section 5.2. As part of the Web fragment generation phase, we construct classifiers for classifying Web fragments using the labelled Web units. The Web fragment classification is discussed in Section 5.3.

Finding and Classifying Web Units in Web Sites

Labeled Web units

Train Web fragment classifier

Web pages from one site

Build Web directory

Generate Web fragments

Classify Web fragments

WebFragGen

Web units

Classify Web units

Train Web unit classifier

Construct Web units

Classification

Figure 3

9

Iterative Web unit mining (iWUM)

Algorithm 1 Iterative Web unit mining method 1: train Web fragment classifier F C 2: build Web site directory 3: generate Web fragments 4: for each Web fragment m do 5: FC.classify(m) 6: end for 7: repeat 8: construct Web units 9: collect Web unit features 10: train Web unit classifier U C 11: for each Web unit u do 12: UC.classify(u) 13: end for 14: until the change of Web units’ category labels is not significant

In the classification phase, Web units are constructed from the labelled Web fragments based on some heuristic rules. Once these Web units are constructed, the information on how the Web site organizes these Web units are collected. For example, are Web units of the same category located together under a common parent folder? Is there a hub page linking to all key pages of Web units of the same category? These information are used as features to train a Web unit classifier which assigns all Web units with category labels. This Web unit construction and classification process repeats itself until the changes of Web units’ category labels are not significant, i.e., the change rate of the category labels is less than a predefined threshold. The Web unit construction and classification are described in detail in Sections 5.4 and 5.5 respectively.

10

A. Sun and E.-P. Lim F0 F0.t = { F0, F1, F2, F3, F4 } F1

F0.s = { F1, F2 } F3 p3 F4 p4

F2 p5

Web folder

p1 F0.p= { p1, p2 } p2

Figure 4

5.2

Web page

Illustration of the symbols and their semantics in Web directory

Web Fragment Generation

Web fragment generation is an important step in iWUM. During Web fragment generation process, some of the support pages are identified and encapsulated in Web fragments so that these pages will not cause Web fragment classification errors. For example, a timetable page of a course Web unit could cause Web fragmentation classification error if it is classified alone. However, if the timetable page is embedded in some Web fragment, such error would not happen. There are two objectives in Web fragment generation: • All the closely-related Web pages (most-likely belonging to the same Web unit) should be grouped together as Web fragments. • No key page of any Web unit should be grouped as a support page of any Web fragment. The second objective can be achieved by making each individual page a Web fragment. Such a solution clearly contradicts the first objective. Based on the earlier observations, we develop two important criteria for generating Web fragments. They are the connectivity indices of Web folders and the names of Web pages and Web folders in the Web directory. The former allows us to determine clusters of relatively better connected Web pages, while the latter explores the naming conventions of Web pages and Web folders to locate key pages [16]. Figure 4 illustrates the meaning of the symbols related to Web directories. Their semantics are also summarized in Table 1. We also denote a page pi directly under folder Fj by pi ∈ Fj .p, or simply pi ∈ Fj . Similarly, pi ∈ Fj .s and pi ∈ Fj .t when pi is under some folder in Fj .s and Fj .t respectively. Note that F.p, F.s, and F.t are sets and we use |S| to denote the number of elements in a set S. Given a set of Web pages from a Web site W , if h(pa , pb ) denotes the number of hyperlinks from page pa to pb , the connectivity from pa to pb , denoted by c(pa , pb ) is defined in Equation 1. c(pa , pb ) = P

h(pa , pb ) h(pa , pb ) ×P pi ∈W h(pa , pi ) pj ∈W h(pj , pb )

(1)

Finding and Classifying Web Units in Web Sites Table 1

Symbol p F F.s F.p F.t

11

Symbol and its semantic in Web directory

Description a Web page. a Web folder in Web directory. set of folders that immediately under F , i.e., sub-folders of F . set of pages that immediately under F . set of folders under the subtree rooted at F , including F itself, its sub-folders and descendent folders.

The connectivity from a page pa to a Web folder Fb is the normalized connectivity from pa to all pages directly under Fb . P pi ∈Fb c(pa , pi ) (2) c(pa , Fb ) = |Fb .p| The connectivity from a folder Fa to another folder Fb is defined as follows. P pi ∈Fa ,pj ∈Fb c(pi , pj ) c(Fa , Fb ) = |Fa .p| × |Fb .p|

(3)

Note that c(pa , pb ), c(pa , Fb ) and c(Fa , Fb ) are all in the range of [0, 1]. From the above connectivity definitions, we define the connectivity index of a given folder Fi , denoted by ϕFi as shown in Equation 4, where item ej can be either a page or a sub-folder directly under Fi .  P 1 + ea ,eb ∈Fi c(ea , eb ) + c(eb , ea ) ϕF i = (4) 1 + |Fi .p| + |Fi .s| Note that the connectivity index of a Web folder reflects the connectivity among the items (pages and sub-folders) that are directly under the folder. The smaller the ϕFi value, the less connected among the pages or sub-folders under Fi suggesting that more-likely Fi is the folder containing multiple Web units (if they ever exist) according to Observations 4 and 5. To determine high connectivity index values, we introduce a threshold denoted by ϕθ . If ϕFi is greater than ϕθ , our iWUM method will try to find a Web page within Fi and use that as a key page of a new Web fragment. This new Web fragment is formed by merging the key page with other pages directly under Fi . However, if ϕFi less than ϕθ , the Web pages/sub-folders in Fi are weakly connected and each page in F.p will form one Web fragment. Other than connectivity index, the iWUM method also finds (Web page, Web folder) pairs such that the Web page and Web folder are directly under a common parent folder and they share a common name. These are also known as the pagefolder-pairs. Armed with the connectivity index values and page-folder-pairs, the iWUM method first sorts the Web folders by their ϕ values in increasing order, and invokes the Web fragment generation function shown in Algorithm 3. These invocations are controlled by a Web directory traversal function shown in Algorithm 2. Algorithm 3 takes a Web folder as input and output a set of Web fragments. In line 10, page-folder-pairs are directly used to create Web fragments. For example, a page named sc101.html and a sub-folder named sc101 form a page-folder-pair under

12

A. Sun and E.-P. Lim

Algorithm 2 Web directory traversal 1: get Web folder list F L from the input Web directory 2: sort F L according ϕ value with increasing order 3: while |F L| > 0 do 4: Fi = F L.pop f ront() 5: if no Web folder in Fi .t is visited then 6: W ebF ragGen(Fi ) 7: else if all folders in Fi .s have been visited then 8: W ebF ragGen(Fi .p) 9: mark Fi as visited 10: else 11: F L.push back(Fi ) 12: end if 13: end while

Algorithm 3 WebFragGen Input: Web folder Fi output: Web fragments 1: mark Fi as visited 2: if ϕFi ≤ ϕθ then 3: for each sub-folder Fj ∈ Fi .s do 4: WebFragGen(Fj ) 5: end for 6: for each page pk ∈ Fi .p do 7: create one-page Web fragment with pk as key page 8: end for 9: else 10: for each page-folder-pair, (pr , Fs ), such that pr ∈ Fi .p and Fs ∈ Fi .s do 11: create Web fragment with pr as key page and pages in Fs .t (subtree) as support pages 12: mark pr and Fs as visited 13: end for 14: for each unvisited sub-folder Fj ∈ Fi .s do 15: WebFragGen(Fj ) 16: end for 17: if there exist a unvisited candidate key page pk ∈ Fi .p then 18: create Web fragment with pk as key page and reachable unvisited pages ∈ Fi .p as support pages 19: mark pk and reachable pages as visited 20: end if 21: for each unvisited page pn ∈ Fi .p do 22: create one-page Web fragment with pn as key page 23: end for 24: end if

Finding and Classifying Web Units in Web Sites

13

the course Web folder. The algorithm generates a Web fragment using sc101.html as a key page and pages in sc101 as support pages. Lines 17 to 20 implement a heuristic rule to find a candidate key page among the unvisited pages in Fi . The heuristic rule examines the URL-type and name of each Web page [16]. A Web page p may be a candidate key page of a Web fragment if: • The URL of p ends with a “/” (i.e, URL-type is root, sub-folder or path); • p and the folder containing it share the same name; • The name of p matches any of the following: home, index, welcome, default, and homepage. In Line 18, the reachable pages refer to the pages that have not been visited and can be reached from the key page pk through hyperlinks among the unvisited pages under folder Fi . Algorithm 2 starts with the Web folder with the smallest ϕ value and invokes the Web fragment generation function in a bottom-up manner. The bottom-up traversal ensures that Web fragments at a high-level folder are generated only after the Web fragments at the sub-folders have been generated. In Line 8, W ebF ragGen(F i .p) is to generate Web fragments only from the pages directly under Fi using the code from Lines 17 to 23 of Algorithm 3. Applying both Algorithms 2 and 3, 6 Web fragments will be generated from the examples given in Figure 1. Figure 5 shows the 6 Web fragments where the underlined URLs are the key pages. Web fragment g1 is generated as the three non-key pages are all reachable from CS100.html through intra-fragment links and similarly g6 . Since there is no candidate key page found under the exams and programs folders, one Web fragment (i.e., g2 to g4 ) is created from each page under the two folders. 5.3

Web Fragment Classification

The purpose of Web fragment classification is to assign a category label (and classification scores) to each Web fragment. The classification of Web fragments is based on their key pages. In this work, we adopted our early Web page classification method to classify Web fragment key pages [24]. Each key page of a fragment is represented by a binary feature vector obtained from the words in the text body, title and in-link anchors. The same words appearing in text, title and/or anchor are treated as different features. All HTML tags are discarded beforehand and words are stemmed using the Porter’s algorithm [22]. In our experiments, SV M light (a Support Vector Machines implementation in C) is used to construct the Web fragment classifiers [14]. As a binary classifier, one Web fragment classifier needs to be constructed for each category and trained with both positive and negative examples. The positive training examples for a category cj , consists of key pages of the Web units labelled with category cj ; the negative training examples include the support pages of Web units in cj and both key pages and support pages of the Web units that do not belong to cj . After Web fragment classification, some of the Web fragments are labelled with categories while others are not.

14

A. Sun and E.-P. Lim

Web fragment id g1

Web pages http://...path/course/CS100/CS100.html http://...path/course/CS100/lecture-programs.html http://...path/course/CS100/instructors.html http://...path/course/CS100/officehours.html ............................................................................... g2 http://...path/course/CS100/exams/final.html ............................................................................... g3 http://...path/course/CS100/exams/prelim.html ............................................................................... g4 http://...path/course/CS100/programs/program1.html ............................................................................... g5 http://...path/course/CS100/programs/program2.html ............................................................................... g6 http://...path/user/Johnson/index.html http://...path/user/Johnson/research.html http://...path/user/Johnson/publications.html http://...path/user/Johnson/activities.html http://...path/user/Johnson/students.html http://...path/user/Johnson/teaching.html http://...path/user/Johnson/contact.html The Web fragments generated from the CS100 course pages and Johnson

Figure 5 homepages

+ 2

3

original Web unit

Figure 6

5.4

key page

1

4

1

2

3

4

5

support page

5

Web unit to be merged

merged Web unit

Web unit merge process

Web Unit Construction

The iWUM method constructs Web units from Web fragments in its first iteration, or Web units returned from the previous iteration. The main task in Web unit construction is to merge a set of unlabeled Web units with some labelled Web unit to form a “larger” Web unit such that the resultant Web unit contains richer and more complete information. Suppose in Figure 5, g1 is assigned to course label and g2 to g5 are not assigned with any label. g2 to g5 are then merged with g1 to form a larger course Web unit. The merge process is shown in Figure 6. The Web unit construction follows two heuristic rules. These rules are applied to Web folders in a Web directory in a top-down manner starting from the root. • If there is one and only one Web unit ui immediately under a Web folder Fi , ui is labelled, and all the other Web units under any folder of the subtree Fi .t

Finding and Classifying Web Units in Web Sites

15

Fr

F1

F2

u1’

u2’ u2

u1

Web folder

Figure 7

Labeled Web unit

Unlabeled Web unit

Web unit construction heuristics

are not labelled, The pages from these unlabelled Web units are merged with ui as support pages. This is illustrated by F1 and u1 in Figure 7. This rule is guided by Observation 3. • If there are more than one Web unit immediately under a Web folder Fi and only one of them, say ui , is labelled (see F2 and u2 ). Pages from the unlabelled Web units immediately under Fi are merged with ui as support pages. If all the Web units in the sub-folders and descendent folders of Fi are also unlabelled, pages from these Web units are merged with ui as support pages. 5.5

Web Unit Classification

The objective of Web unit classification is to improve Web unit mining accuracy by considering the organization of Web units within the Web site and the word features in the Web page names and URLs (see Observations 5 and 6). By considering the organization of Web units in a Web site, a Web unit can be described by two separate set of features: content features and Web site structure features. The content features refer to the words in the key page, the title and the anchor words, (same as those features used in the Web fragment key page classification). Web site structure features refers those obtained by studying how the Web units are organized in a Web site. Web site structure features used in our experiment are also listed in this section. To fully utilized both the content features and Web site structure features, we applied the co-training algorithm in Web unit mining [1, 18, 6]. Two assumptions are made in co-training algorithm: • Each instance to be classified can be described by two set of features and each set of features is sufficient for correct classification. • The two sets of features are conditionally independent. Based on the observations discussed in Section 4, it is reasonable to assume that the Web site structure features are sufficient for Web unit classification. One can

16

A. Sun and E.-P. Lim g1

g2

g3

...

content features

gn

Web fragment classifier Web unit labels

Web site structure features Web unit classifier

u1

u2

u3

...

un

Web unit labels

gi

Figure 8

Web fragment

ui

Web unit

Web unit classification process

also assume that the content features and Web site structure features are conditionally independent. These two assumptions make it possible to adopt co-training algorithm in Web unit classification. There are two sub-steps in co-training. The first sub-step is to construct the Web units from Web fragments which are classified based on content features only. In the second sub-step, the Web site structure features are derived for constructing the Web unit classifiers. The Web unit classification process is illustrated in Figure 8 where the iterative portion is enclosed in a dotted box. The iteration is guided by the ExpectationMaximization (EM) algorithm [9]. EM is an iterative algorithm for maximum likelihood estimation in parametric estimation problems with missing data [17]. In iWUM, the category labels of the Web units are considered as the missing data. In the M-step, the Web site structure features are obtained from the weakly labelled Web units. Using these Web units and the Web site structure features, a new set of Web unit classifiers are constructed. In the E-step, all Web units are classified using the Web unit classifier and the category labels are updated according to the combined classification score. Naturally, a condition to stop the iteration process is when there are no further changes to the Web units’ labels. However, this may incur much computation. Hence, we require iWUM to stop when the category label changes are not significant. Suppose among m Web units, n of them changed their category labels. The category label change rate, denoted by δ is defined as δ = n/m. In our experiment, δ = 0.05 meaning that iWUM will stop if there are less than 5% of Web units changing their labels. The relationship between the change rate δ and the number of iterations is further studied in Section 7. In our experiments, the following Web site structure features are derived for a Web unit ui . • Combined classification score for each category Both the Web fragment key page classifiers and Web unit classifiers are constructed based on SVM classifiers. The SVM classifier outputs are in the range (−∞, +∞). We normalize each output to the range of [0, 1] using a logistic link function [26, 21]. Let fs (ui |cj ) be the actual score returned by the SVM classifier for a category cj , the normalized classification score of ui for cj is fn (ui |cj ) = 1/(1 + e−fs (ui |cj ) ). Let fg (ui |cj ) denote the normalized Web fragment key page classification

Finding and Classifying Web Units in Web Sites

17

score and fu (ui |cj ) be the normalized Web unit classification score of ui for p category cj respectively. The combined classification score s(ui |cj ) = fg (ui |cj ) × fu (ui |cj ). Note that before the first iteration in Web unit classification, fu (ui |cj ) is not available and s(ui |cj ) = fg (ui |ci ). • Closeness to the average depth for each category The depth of ui , denoted by di , is defined as the depth of the Web folder (in the Web directory) containing ui ’s key page. Let the maximum depth of a Web directory be dm and the average depth of Web units assigned with category label cj be d¯j . The closeness to the average depth of ui for category cj is dis(d|cj ) = 1.0 − |di − d¯j |/dm . • Highest in-link hub value for each category If there exist a link from page pa to pb , pa is considered as a hub page of pb . Suppose there are m Web units assigned with category label cj , and n of them are links targets of page pk , i.e., pk has links to the key pages of these n Web units, the hub value of pk for cj is hub(pk |cj ) = n/m. If Pi is the set of pages having links to the key page of a Web unit ui , the highest in-link hub value of ui for cj is hm (cj )=arg max hub(pk |cj ). This feature is defined pk ∈Pi

based on Observation 6. • Precision support of parent Web folder for each category The precision support of a folder Fi for a category cj is the percentage of Web units assigned the cj label among all the ones directly under Fi . If there is only one Web unit under a folder, the precision support will be either 1 or 0 depending on whether the category given matches the Web unit’s category label. This situation often happen for multi-page Web units according to Observation 5 since each multi-page Web unit resides in one Web folder (under a common Web folder). We therefore consider the precision support of the Web folder at the parent level of the Web folder containing the Web unit. Let Fi be the parent folder of the folder containing ui . Suppose there are m Web units at the grandchild level of Fi , and n of them are under category cj , the precision support of Fi is P r(Fi |cj ) = n/m. This value is used by all the Web units at the grandchild level of Fi that have been assigned to the cj label. • Recall support of parent Web folder for each category Similarly, let Fi be the parent folder of the folder containing ui . Among all m Web units assigned with cj , if n of them are at the grandchild level of Fi , the recall support is Re(Fi |cj ) = n/m. The precision/recall support of parent Web folder are defined based on Observation 5. • Words in page names and URLs The page names refers to both the key page file name and the file names of the support pages in a Web unit. Words in a Web unit’s URL refer to the words in the URL of its key page (excluding the file name). To distinguish these three set of features, these features are weighted differently. Words in the name of ui ’s key page are assigned the weights of 1.0. Words in the URL of ui ’s key page are equally weighted with a sum of 1.0. Words in the names of ui ’s support pages’ are equally weighted and have a sum of 1.0. Note that all the features discussed above have the value in the range of [0, 1]. If a

18

A. Sun and E.-P. Lim

Table 2

Contingency table for Web unit ui

Web unit evaluation Constructed Web unit ui

ui .k ui .s NU

Perfect Web unit u0i u0i .k u0i .s NU T Ki SKi – KSi T Si F Si N K i N Si –

word appears more than once and has a combined weight of more than 1.0, its weight will be fixed at 1.0. Among all these features, only the word features are not category-specific. The other features are category-specific. To construct the Web unit classifier for each category, we first split the Web units into two subsets by a classification score threshold (0.5 in our experiment). The top 80% ranked Web units from the upper subset and the bottom 80% ranked Web units from the other subset are used as positive and negative training examples respectively. The Web site structure features are then derived from these training Web units. The features for training a Web unit classifier of a category cj include the combined classification score of all the categories, word features, and other features associated with c j only. Suppose there are |C| categories. The feature vector length for a Web unit is |C|× (combined classification score)+1×(closeness) +1×(hub value)+1×(precision support)+1×(recall support)+ multiple (word feature values in names and URLs).

6

Web Unit Mining Evaluation

We evaluate a Web unit mining method by the number of correctly classified Web units. While this sounds similar to the conventional Web page or text classification, there are two distinct differences. Firstly, Web units are sets and they are discovered in the process of Web unit mining, it is difficult and unfair to expect perfect matching between the labelled Web units and the mined Web units. Secondly, the notions of key and support pages also complicate the performance measurement. Hence, the standard precision and recall measures in text classification cannot be applied directly. In this paper, we therefore propose new precision and recall definitions for Web unit mining. To account for the importance of key pages, we introduce a weight factor α to represent the degree of importance when the key page of a Web unit is correctly identified. Given a Web unit u, the value of α is in the range [1/|u|, 1] where |u| is the number of Web pages in Web unit u. If α = 1, the importance of key page completely dominates over the support pages. This also suggests that the Web unit mining performance is only determined by its ability to identify and classify key pages correctly. If α = 1/|u|, every key page or support page of a Web unit enjoys equal importance. By choosing a α value, we can assign appropriate importance to key and support pages in a Web unit mining performance metric. Given a Web unit ui constructed by a Web unit mining method, we must first match it with an appropriate labelled Web unit u0i , also known as the perfect Web unit. We define u0i to be the labelled Web unit containing ui .k and u0i has the same label as ui ; ui .k can be either the key page or a support page of u0i .

Finding and Classifying Web Units in Web Sites

19

The contingency table for matching a Web unit ui with its perfect Web unit u0i is shown in Table 2. Each table entry represents the overlapping Web pages between the key/support pages of ui and u0i . For example T Ki = {ui .k} ∩ {u0i .k} and T Si = ui .s∩u0i .s. The entries in the last column and row account for pages that appear either in ui or u0i , but not both. F Si = ui .s−(u0i .s∪{u0i .k}). N Ki = {u0i .k}− {ui .k} − ui .s. N Si = u0i .s − {ui .k} − ui .s. Note that |T Ki | + |KSi | + |N Ki | = 1 and |T Ki | + |SKi | = 1. If the perfect Web unit for ui does not exist, ui is considered invalid and will be assigned zero precision and recall values. Otherwise, the precision and recall of a Web unit, ui , are defined as follows. (5)

P r ui

=

(6)

Reui

=

α · |T Ki | + (1 − α) · |T Si | α + (1 − α) · (|KSi | + |T Si | + |F Si |) α · |T Ki | + (1 − α) · |T Si | α + (1 − α) · (|SKi | + |T Si | + |N Si |)

Suppose M Web units are constructed and assigned with category cj , and N Web units are manually labelled with cj . The precision and recall of the category, cj , denoted by P rcj and Recj , are defined as follows. P ui ∈cj P rui (7) P r cj = P M ui ∈cj Reui Recj = (8) N From the above definitions, we also easily derive the macro/micro-averaged precision and recall for a set of categories. Note that, if α = 1, a Web unit is evaluated purely based on the key page, i.e., P rui ∈ {0, 1} and Reui ∈ {0, 1}, the P rcj and Recj are equivalent to the precision and recall definitions commonly used in IR. Therefore, the performance of Web unit mining (α = 1.0) can also be compared with the performance of Web page classification on the same dataset (see Section 7.2).

7 7.1

Experiments and Results Dataset and Experimental Setup

There are no existing labelled Web unit datasets for our experiments. We have therefore chosen to label Web units in the WebKB dataset that has been commonly used in Web page classification research. The 4159 Web pages collected from four universities were manually classified into 7 categories: student, faculty, staff, department, course, project and other. The other is a special category for those pages that were not assigned as the “main pages” in the first six categories. We manually grouped the pages in WebKB into Web units and labelled them. This was done by using the pages with non-other labels as key pages of Web units and most pages with the other category label as support pages. We call the this dataset UnitSet. Some instances have multiple “key pages”, for example, CS631/home.html and CS631/Welcome.html were both labelled as course pages in original WebKB dataset. Since the two course pages actually refer to the same

20

A. Sun and E.-P. Lim

Table 3

Web unit distribution in UnitSet

Category University Cornell Texas Washington Wisconsin

student u p 128 301 148 370 126 495 156 416

course u p 42 219 38 95 74 360 82 413

faculty u p 34 60 46 104 31 71 42 83

project u p 20 78 20 115 21 129 25 90

department u p 1 6 1 14 1 1 1 8

staff u p 21 57 3 10 10 23 12 27

course instance and a Web unit can have only one key page, the first page is used as the key page and the other page as a support page. There are some pages from the other category that cannot be labelled as support pages of any Web unit and we excluded them from UnitSet. Only Web units labelled with the student, faculty, course and project categories were experimented as the numbers of Web units for the other categories are small (they are often less than 20 in UnitSet). The dataset statistics are shown in Table 3 where u and p refer to number of Web units and pages respectively. To derive the precision and recall values, we use leave-one-university-out crossvalidation in our experiments. That is, in each run we use Web units from three universities as training data and the pages from the fourth university as test data. Note that the UnitSet is used to train our Web fragment classifiers and to provide the perfect Web units for evaluating the Web unit mining methods. The test pages (from the test university) are however taken from the original WebKB dataset and they include pages that may not be included in UnitSet. In our experiments, Web fragment classifiers and Web unit classifiers are all constructed based on SV M light [14]. Performance of Web unit mining methods are measured by precision and recall denoted by P r and Re respectively. The macro and micro averages of precision/recall are denoted by P r M /ReM and P r µ /Reµ respectively. The F1 measure computed from P r and Re will also be reported. More details about these measures can be found in [23, 25]. 7.2

Web Unit Mining Results

First, we would like to measure the performance of our proposed iWUM method and compare it with two baseline methods, namely the pure baseline and the baseline with fragments methods. Recall that the pure baseline method (see Section 5) classifies Web pages and merge them to construct Web units. Compared to iWUM, the baseline method involves only three steps: train Web page classifiers, classify Web pages, and construct Web units. The Web unit construction algorithm is the same as that of iWUM except that only single-page Web fragments are used to construct Web units. Different from the baseline method, the baseline with fragments method (denoted by baseline(fragment)) deals with generated Web fragments instead of single page Web fragments only. It consists of five steps: train Web fragment classifiers, build Web directory, generate Web fragments, classify Web fragments and construct Web units. It does not include the Web unit classification step and the iterative nature of iWUM. The purpose of including baseline(fragment) is to investigate if the iterative Web unit construction and classification actually improve Web unit

Finding and Classifying Web Units in Web Sites Table 4

Web unit mining results (α = 1.0)

Category project student faculty course MacroAve MicroAve

Table 5

Pr 0.378 0.808 0.825 0.702 0.678 0.771

baseline Re 0.119 0.659 0.413 0.603 0.449 0.562

F1 0.173 0.721 0.542 0.643 0.520 0.648

baseline(fragment) Pr Re F1 0.394 0.119 0.175 0.856 0.659 0.741 0.865 0.396 0.533 0.778 0.549 0.641 0.723 0.431 0.523 0.828 0.548 0.658

Pr 0.350 0.908 0.919 0.761 0.735 0.872

iWUM Re 0.195 0.885 0.752 0.674 0.627 0.762

F1 0.250 0.895 0.820 0.713 0.669 0.813

Pr 0.334 0.895 0.856 0.759 0.711 0.854

iWUM Re 0.188 0.870 0.734 0.648 0.610 0.746

F1 0.240 0.881 0.784 0.698 0.651 0.796

Pr 0.332 0.902 0.908 0.772 0.729 0.868

iWUM Re 0.187 0.868 0.733 0.645 0.608 0.744

F1 0.239 0.883 0.802 0.701 0.656 0.801

Web unit mining results (α = 0.5)

Category project student faculty course MacroAve MicroAve

Table 6

21

Pr 0.378 0.807 0.774 0.717 0.669 0.770

baseline Re 0.113 0.624 0.392 0.509 0.409 0.518

F1 0.166 0.700 0.515 0.591 0.493 0.618

baseline(fragment) Pr Re F1 0.394 0.113 0.168 0.847 0.644 0.728 0.763 0.393 0.512 0.785 0.526 0.625 0.697 0.419 0.508 0.815 0.534 0.644

Web unit mining results (α = 1/|ui |)

Category project student faculty course MacroAve MicroAve

Pr 0.378 0.812 0.838 0.724 0.688 0.781

baseline Re 0.111 0.619 0.388 0.498 0.404 0.512

F1 0.165 0.699 0.523 0.586 0.493 0.618

baseline(fragment) Pr Re F1 0.394 0.111 0.167 0.852 0.642 0.729 0.860 0.393 0.529 0.793 0.524 0.625 0.725 0.418 0.512 0.828 0.532 0.646

mining results. For the baseline(fragment) and iWUM methods, we set ϕ θ to be 0.1667. This is equivalent to at least 5 items under a folder carrying no links among them. Overview of performance results The performance of the three methods are measured using different α values, i.e., 1.0, 0.5 and 1/|ui | (see Section 6). The results for these α values are reported in Tables 4, 5 and 6 respectively. The macro/micro averaged measures of the three methods using different α values are shown in Figure 9. For iWUM, all the results are obtained by setting the category label change rate δ to 0.05 (see Section 5.5). When α =1.0, Web units are evaluated purely based on key pages. Since support pages are not considered, the evaluation is similar to Web page classification. Compared to our early Web page classification research [24], the overall performance of baseline method measured by F1 is slightly poorer. The reason is that the SCut thresholding strategy [28] used in our earlier work resulted in balanced precision and recall values. As we do not apply SCut in these experiments, we obtain precision values significantly higher than recall. The effect of adding on Web fragment generation to the baseline method is shown under the baseline(fragment) column in Table 4. The baseline(fragment) yields noticeable improvement in precisions for all the categories but slight degradations in recall for faculty and course. This could be explained by the number of key page misses in Web fragment generation as shown in Table 7. A key page is

22

A. Sun and E.-P. Lim 0.9 baseline 1.0 baseline 1/u baseline(fragment) 1.0 baseline(fragment) 1/u iWUM 1.0 iWUM 1/u

0.85 0.8 0.75

values

0.7 0.65 0.6 0.55 0.5 0.45 0.4 MaPr

Figure 9

Table 7

MaRe

MaF1 MiPr Macro/Micro measures

MiRe

MiF1

Macro/Micro-averaged results of the three methods

Web Fragment Generation for iWUM and baseline(fragment)

University Cornell Texas Washington Wisconsin

Pages 858 817 1197 1249

Fragments 622 519 821 736

Key-page-miss 6 8 1 20

considered a miss when it is included only as a support page in a Web fragment. On the whole, by comparing the results of baseline and baseline(fragment), we conclude that Web fragment generation has positive contributions to both macro and micro averaged F1 measures. The higher precision achieved by Web fragment generation is important to iWUM method as higher precision ensures that better quality training Web units can be used in the iterative Web unit classification process. The iWUM method delivers significant improvement in recall values for all categories compared to the baseline and baseline(fragment) methods (See Table 4). Improvement in precision are achieved for student and faculty which are the two largest categories among the four categories. The improvement in F1 measure of faculty is also significant. Nevertheless, poorer precision values are reported for project and course. The results suggest that university Web sites normally have a better structure for key pages of students or faculty members but not for projects and courses pages. Nevertheless, the iWUM’s results for project is not representative as there were around 20 project instances in each test run. Hence, in the iterative Web unit classification process, there were less than 20 positive training project Web units. According to [10], at least 30 positive/negative training examples are required for a SVM classifier to deliver generalized classification performance. The lack of positive training examples may jeopardize the SVM classifier’s performance. Incidentally, we found out that the precision and recall were zeroes for project when

Finding and Classifying Web Units in Web Sites 0.45

23

Cornell Texas Washington Wisconsin

0.4

Concept label change rate

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

1

2

3

4

5

6

7

Number of iterations

Figure 10

Category label change rate after each iteration

the Texas and Washington datasets were used as test datasets. The performance results with α = 0.5 for the 3 methods are reported in Table 5. Each Web unit is evaluated with the key page having a weight of 0.5 and a total weight of 0.5 for all the support pages. Note that there are slight degradations in most precision and recall values compared to those with α = 1.0. This indicates that our methods do not detect support pages perfectly. Some noise pages could have been included in the derived Web units while some support pages might have been missed. Nevertheless, there is slight improvement in the precision for course using the baseline and baseline(fragment) methods. On average a course Web unit has more than 5 pages, the largest among all the four categories (refer to Table 3). The correctly identified support pages in this case may help to improve the precision. The comparison between the results with α = 1.0 and α = 0.5 reveals that the heuristic rules in Web unit construction work well as there were no significant degradation in both precision and recall. Similar observation also holds comparing the results when α = 1.0 and α = 1/|ui | as shown in Table 6. Overall speaking, the macro/micro averaged F1 value is highest when α = 1.0 followed by that when α = 1/|ui | and α = 0.5. Detailed Performance of iWUM When the Web pages from each university are used in a test dataset, the category label change rate (δ) after each iteration of the Web unit classification process (see Section 5.5) is shown in Figure 10. The iWUM method (iterative Web unit classification part) stops after 6 or 7 iterations. Surprisingly all the four university datasets behave quite similarly. Most of the Web unit category label changes took place in the first two iterations. From the third iteration onwards, there are less than 5% of Web units that changed their category labels and no significant change on the classification results. Considering the computational cost for each iteration, we argue that δ = 0.05 is a good condition to stop the iteration. The precision/recall values of the iWUM method for student faculty and course at each iteration are plotted in Figure 11. As the smallest category, project is not shown because the results are not very representative. Relatively larger changes

A. Sun and E.-P. Lim 1

0.8

0.8

0.6

Pr_student Re_student Pr_faculty Re_faculty Pr_course Re_course

0.4

0.2

0

Precision/Recall values (alpha=1.0)

Precision/Recall values (alpha=1.0)

1

1

2

3 4 Number of iteration (Cornell)

5

0.6

0.2

1

1

0.8

0.8

0.6

Pr_student Re_student Pr_faculty Re_faculty Pr_course Re_course

0.4

0.2

0

1

Figure 11 1.0)

2

3 4 5 Number of iteration (Washington)

6

7

Pr_student Re_student Pr_faculty Re_faculty Pr_course Re_course

0.4

0

6

Precision/Recall values (alpha=1.0)

Precision/Recall values (alpha=1.0)

24

1

2

3 4 5 Number of iteration (Texas)

6

7

0.6

Pr_student Re_student Pr_faculty Re_faculty Pr_course Re_course

0.4

0.2

0

1

2

3 4 5 Number of iteration (Wisconsin)

6

7

Precision/recall values for student, faculty and course (ϕθ = 0.1667, α =

in precision/recall values occur at the first and second iterations. After that, the measures are more or less stable. This is expected because fewer category label changes for Web units take place during the last few iterations shown in Figure 10. For all the four university test datasets, the recall of all the three categories improved very much. For Cornell, both precision and recall improved after the first iteration for all the three categories. For Texas, smaller precision for course is observed. The same phenomenon is also observed for Wisconsin. For Washington, these is a slightly decrease in the precision of student. We have checked UnitSet and noticed that the course instances of both Texas and Wisconsin are not very well structured. These instances are normally located in the lecturers’ home pages. However, in Cornell and Washington, most of the course Web units are organized under some common Web folders. It is therefore fair to comment that the features used in Web unit classification are more applicable to structured Web sites. Existence of folders having higher precision/recall supports and existence of hub pages for certain categories could be indications of structured Web sites. The improvement of the Web unit mining performance for less-structured Web sites will be an interesting topic for our future study. Effect of ϕθ in Web fragment generation and Web unit mining To evaluate the effect of the connectivity index threshold, ϕθ , the iWUM method is executed with different ϕθ values. These values are selected based on several equivalent cases. Given a Web folder Fi containing n Web pages and 0 sub-folders. Among these n Web pages, suppose there is no connectivity between any of two pages, i.e., c(pj , pk ) = 0 where pj ∈ Fi .p, pk ∈ Fi .p. According to Equation 4, the

Finding and Classifying Web Units in Web Sites 1200

25

Cornell Texas Washington Wisconsin

20 Number of key pages missed

Number of Web fragments generated

1100 1000 900 800 700 600

Cornell Taxes Washington Wisconsin

500 400

0.1

0.15

Figure 12

0.2

0.25 0.3 0.35 Connectivity index threshold

0.4

0.45

10

5

0

0.5

0.1

0.15

0.2

0.25 0.3 0.35 Connectivity index threshold

0.4

0.88

Pr Re F1

0.72

0.45

0.5

Pr Re F1

0.86 0.84 Micro Measures

0.7 Macro Measures

15

Number of Web fragments and number of key-page-miss against ϕθ

0.74

0.68 0.66

0.82 0.8

0.64

0.78

0.62

0.76

0.6

25

0.1

0.15

Figure 13

0.2

0.25 0.3 0.35 Connectivity index threshold

0.4

0.45

0.5

0.74

0.1

0.15

0.2

0.25 0.3 0.35 Connectivity index threshold

0.4

0.45

0.5

Macro/Micro measures against change of ϕθ (α = 1.0)

connectivity index of Fi , ϕFi =1/(n + 1). By setting n from 1 to 9, a set of ϕθ value is obtained in the range of 0.1 to 0.5 as shown in Figures 12 and 13. The number of Web fragments generated and the number of key-page-misses are shown in Figure 12. The higher the ϕθ value, the more the Web fragments generated. The reason is that each Web page will be treated as a Web fragment under a folder if the folder’s connectivity index is less than ϕθ . However, a lower ϕθ value results in large key-page-miss (see Texas and Washington), since many weakly-connected Web fragments will be generated. Some of the key pages are included in these weakly-connected Web fragments as support pages. The Macro/Micro measures of iWUM method (α = 1) against various ϕθ values are shown in Figure 13. Compared with precision and F1 , recall values are stable when ϕθ ≥ 0.167. It shows that the large number of key-page-miss affect the recall value when ϕθ ≤ 0.143. The precision values start to drop when ϕθ > 0.167 because more number of Web fragments are generated when a higher ϕθ is chosen. For instance, when ϕθ = 0.5, the number of Web fragments generated are close to the number of Web pages from each Web site. In summary, a ϕθ value in the range of 0.16 to 0.2 is suggested. Informal Verification of the Observations The observations listed in Section 4 are used as our guidelines in our proposed iWUM method. To understand the validity of these observations and the extent to which they can be useful to our proposed iWUM, we performed some statistical analysis on the four university datasets.

26

A. Sun and E.-P. Lim

Table 8

Statistics on semantic relations of Web pages under Web folders

University Cornell Texas Washington Wisconsin

Web Folder 142 125 190 220

One Unit 0.838 0.880 0.905 0.773

Category Unit 0.021 0.008 0.011 0.027

None Unit 0.092 0.048 0.058 0.073

Observation 1 Two Web pages are semantically related if they satisfy any of the following cases: (1) they belong to the same Web unit, and (2) they belong to different Web units from the same category. In a Website, some Web folders contain sub-folder(s) only and some of them contain only one Web page. We therefore evaluate every Web folder containing at least two Web pages. In Table 8, we show for each university dataset the number of Web folders containing at least two Web pages each (Web folder); the percentage of the Web folders under which all Web pages belong to one Web unit (one unit); the percentage of the Web folders under which Web pages belong to multiple Web units from the same category (category unit); and the percentage of Web folders under which no Web page belong to any Web unit (none unit). It shows that around 80% of Web folders containing Web pages that are all from the same Web unit. This statistics strongly support our algorithm 3, where Web folders are used as Web fragment boundaries. Observation 2 We evaluate this observation by checking through all the Web units from each of the four university Web site. We define a Web unit is connected if all the support pages are reachable from the key page through intra-unit links and unconnected otherwise. One-page Web unit is always connected. We found that 95.9% of the Web units from Cornell, 94.5% from Texas, 89.7% from Washington and 92.5% from Wisconsin are connected. In summary, this observation is generally held by the four university Web sites. Observation 3 This observation is verified during our manual labelling of the UnitSet dataset. Nevertheless, our manual labelling process is subjective since this requires human judgement. This observation is also partially supported by Observation 5. Observation 4 For each university Web site, among the Web units belonging to each category, the proportion of Web unit pairs having direct links between their key pages is shown in Table 9. It shows that around 1-3% of project Web unit pairs have links between their key pages and less than 1% for student, faculty, course and staff Web unit pairs. This observation holds in all the four university datasets. Observation 5 The statistics on multi-page units are reported in Table 10. The percentage column reports for each category the percentage of multi-page units (against both multi- and one-page Web units). The recall column reports the percentage of multi-page units located in the Web folder(Web folder column) that hosts most of these Web units. Generally speaking, other than the faculty Web units, 50% of the Web units of each category are multi-page

Finding and Classifying Web Units in Web Sites Table 9

Proportion of key-page-connected Web unit pairs from the same category

University Cornell Texas Washington Wisconsin

Table 10

27

Project 0.018 0.032 0.014 0.020

Student 0.001 0.001 0.002 0.001

Faculty 0 0 0.006 0.002

Course 0.003 0 0.003 0.004

Staff 0.005 0 0 0.015

Statistics on multi-page Web units

University Cornell

Texas

Washington

Wisconsin

Category project student faculty course staff project student faculty course staff project student faculty course staff project student faculty course staff

Percentage 0.600 0.563 0.324 0.714 0.667 0.800 0.541 0.239 0.474 0.667 0.857 0.746 0.323 0.622 0.700 0.640 0.608 0.309 0.597 0.583

Recall 0.417 0.986 1.000 0.333 0.785 0.625 1.000 1.000 0.278 1.000 0.167 1.000 1.000 0.609 1.000 0.437 0.989 0.923 0.183 1.000

Web folder /info/projects /info/people /info/people /info/courses/current /info/people /users/ /users/ /users/ /users/ /users/ /research/ /homes/ /homes/ /education/courses/ /homes/ / ∼ ∼ ∼ ∼

units. About 80% to 100% of student, staff and faculty Web units are located in the same parent folder, i.e., people in Cornell, users in Texas, homes in Washington and ∼ in Wisconsin. The course Web units are not always located in one Web folder although some Web folders were created for them, such as info/courses/current in Cornell and education/courses in Washington. This remark also applies to the project Web units. In summary, Observation 5 holds in the four Web sites.

Observation 6 We computed the hub value hub(pk |cj ) for each page pk for category cj using the formula in Section 5.5. The pages (URLs) with the highest hub values for each category are listed in Table 11. As shown in Table 11, some hub pages point to 80% to 90% of Web units from the respective categories. For example, students.html for student category in Cornell. Observation 6 holds by all the four Web sites.

28

A. Sun and E.-P. Lim

Table 11

Pages with highest hub values for each category

University Cornell

Texas

Washington

Wisconsin

8

Category project student faculty course staff project student faculty course staff project student faculty course staff project student faculty course staff

Hub value 0.850 0.984 0.912 0.286 0.905 0.950 0.993 0.674 0.684 0.667 0.143 0.873 0.903 0.676 0.300 0.520 0.974 0.143 0.354 0.250

URL /info/projects.html /info/people/students.html /info/faculty/faculty-list.html /info/courses/Spring-96/courses.html /info/people/researchers.html /docs/research.html /docs/grad.html /docs/prof.html /docs/classes.html /users/jbc/home/facilities.html /homes/bershad/ /people/grads/ /people/faculty/ /education/course-Webs.html /research/projects/spin /www/external/members.htm /rsch-info/ /directories/gradlist.html /∼kristint/dbmshome.html /directories/classes.html /∼kristint/dbmshome.html

Conclusions

In this paper, we proposed the notion of Web unit and studied the problem of Web unit mining. We proposed an iterative Web unit mining method (iWUM) that involves Web fragment generation, Web fragment classification, iterative Web unit construction and Web unit classification. The method is evaluated against two baseline methods using a specially crafted dataset derived from WebKB. We also propose appropriate measures for evaluating Web unit mining methods. We have shown that our iWUM method works well in the experiments and is extremely effective for well structured Web sites. We believe that our proposed iWUM method can be further improved, particularly the Web fragment generation and classification, Web unit construction and classification. In Web fragment generation, we plan to propose an algorithm to automatically determine the Web folder connectivity index threshold (ϕ θ ). In Web fragment classification, how to classify Web fragments as sets of Web pages (instead of classifying the key pages) remains to be further investigated. In the iterative Web unit construction and classification process, selection of the training (weakly labelled) Web units and the feature selection on the Web site structure features need to be further studied. In addition, much more research should be conducted in Web unit indexing and searching, i.e., using Web units for organizing information instead of Web pages. We also intend to expand our experiments on much larger datasets.

Finding and Classifying Web Units in Web Sites

29

References and Notes 1 A. Blum and T. Mitchell. Combining labeled and unlabeled data with co–training. In Proc. of the 11th Annual Conf. on Computational Learning Theory (COLT98), pages 92–100, Madison, Wisconsin, 1998. ACM Press. 2 A. Z. Broder, R. Krauthgamer, and M. Mitzenmacher. Improved classification via connectivity information. In Proc. of 11th ACM-SIAM Symposium on Discrete Algorithm, pages 576–585, San Francisco, United States, 2000. Society for Industrial and Applied Mathematics. 3 K. S. Candan and W.-S. Li. Reasoning for web document associations and its applications in site map construction. Data & Knowledge Eng., 43(2):121 – 150, Nov 2002. 4 S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. of ACM SIGMOD, pages 307–318, Seattle, 1998. ACM Press. 5 Z. Chen, S. Liu, W. Liu, G. Pu, and W.-Y. Ma. Building a web thesaurus from web link structure. In Proc. 26th ACM SIGIR, pages 48–55, 2003. 6 W. W. Cohen. Improving a page classifier with anchor extraction and link analysis. In In Advances in Neural Processing Systems 15 (NIPS02), Vancouver, British Columbia, 2002. 7 N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In Proc. of ACM SIGIR, pages 250 – 257, New Orleans, 2001. ACM Press. 8 M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43(1-2):97–119, 2001. 9 A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B, 39(1), Nov 1977. 10 S. T. Dumais and H. Chen. Hierarchical classification of Web content. In Proc. of ACM SIGIR, pages 256–263, Athens, Greece, 2000. ACM Press. 11 M. Ester, H.-P. Kriegel, and M. Schubert. Web site mining: A new way to spot competitors, customers and suppliers in the world wide web. In Proc. of ACM SIGKDD, pages 249–258, Alberta, Canada, 2002. ACM Press. 12 L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text and link structure for hypertext classification. In Proc. of Intl Joint Conf. on Artificial Intelligence Workshop on Text Learning: Beyond Supervision, Seattle, WA, 2001. 13 D. Hawking and N. Craswell. Overview of the TREC-2001 web track. In Proc. of TREC, Maryland, 2001. http://trec.nist.gov/. 14 T. Joachims. Making large-scale svm learning practical. In B. Sch¨ olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169–184. MIT-Press, 1999. 15 T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorization. In Proc. of ICML, pages 250–257, San Francisco, 2001. Morgan Kaufmann. 16 W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In Proc. of ACM SIGIR, pages 27 – 34, Tampere, Finland, Aug 2002. ACM Press. 17 A. McCallum and K. Nigam. Text classification by bootstrapping with keywords, EM and shrinkage. In Proc. of ACL Workshop for Unsupervised Learning in Natural Language Processing, Maryland, Jun 1999.

30

A. Sun and E.-P. Lim

18 K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proc. of ACM CIKM, pages 86–93, McLean, VA, Nov 2000. ACM Press. 19 H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext categorization method using links and incrementally available class information. In Proc. of ACM SIGIR, pages 264–271, Athens, Greece, 2000. ACM Press. 20 J. M. Pierre. On the automated classification of web sites. Electronic Trans. on Artificial Intelligence, 6, 2001. 21 J. C. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In P. J. Bartlett, B. Sch¨ olkopf, D. Schuurmans, and A. J. Smola, editors, Advances in Large-Margin Classifiers, pages 61–74. MIT Press, 2000. 22 M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. 23 F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002. 24 A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using support vector machine. In Proc. of WIDM held in conj. CIKM, pages 96 – 99, Virginia, 2002. ACM. 25 A. Sun, E.-P. Lim, and W.-K. Ng. Performance measurement framework for hierarchical text classification. Journal of the American Society for Information Science and Technology (JASIST), 54(11):1014–1028, Sep 2003. 26 G. Wahba. Support vector machines, reproducing kernel hilbert spaces and the randomized gacv. In B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 69 – 88. MIT Press, 1999. 27 T. Westerveld, D. Hiemstra, and W. Kraaij. Retrieving web pages using content, links, urls and anchors. In Proc. of TREC, Maryland, 2001. http://trec.nist.gov/. 28 Y. Yang. A study on thresholding strategies for text categorization. In Proc. of ACM SIGIR, pages 137–145, New Orleans, 2001. ACM Press. 29 Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. J. of Intelligent Info. Sys., 18(2-3):219–241, 2002.

Suggest Documents