Framework for Mining Web Content Outliers - ACM Digital Library

12 downloads 193 Views 128KB Size Report
data for outliers is more interesting than finding outliers in numeric data sets. Interestingly, the existing web mining algorithms have concentrated on finding ...
2004 ACM Symposium on Applied Computing

Framework for Mining Web Content Outliers Malik Agyemang

Ken Barker

Reda Alhajj

Department of Computer Science University of Calgary 2500 University Drive N.W. Calgary, Alberta, Canada

{agyemang, barker, alhajj}@cpsc.ucalgary.ca

ABSTRACT Outliers are data objects with different characteristics compared to other data objects. Exploring the diverse and dynamic web data for outliers is more interesting than finding outliers in numeric data sets. Interestingly, the existing web mining algorithms have concentrated on finding patterns that are frequent while discarding the less frequent ones that are likely to contain the outlying data. This paper refers to outliers present on the web as web outliers to distinguish them from traditional outliers. Web outliers are data objects that show significantly different characteristics than other web data. Although the presence of web outliers appears obvious, there is neither formal definition for web outliers nor algorithms for mining them. Secondly, traditional outlier mining algorithms designed solely for numeric data sets are inappropriate for mining web outliers. This paper establishes the presence of web outliers and discusses some practical applications of web outlier mining. Finally, we present taxonomy for web outliers and propose a general framework for mining web content out.

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data Mining

General Terms: Algorithms, Design

Keywords: Data mining, web outliers, content-specific algorithms, resource extraction. 1. INTRODUCTION Exponential growth of the web makes it a popular and fertile place for research. The huge, diverse, dynamic and unstructured nature of the web calls for automated tools for tracking and analyzing web data and their usage patterns. This has given rise to the development of many sever-side and client-side intelligent systems for mining information on the web.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC ’04, March 14-17, 2004, Nicosia, Cyprus. Copyright 2004 ACM 1-58113-812-1/03/04 ...$5.00.

Web mining has been described as the discovery and analysis of interesting and useful patterns from the web [6, 14]. However, existing web mining algorithms deal with finding frequent patterns while eliminating less frequent ones usually described as nuisance, noise, or outliers. Outliers are observations that deviate so much from other observations to arouse suspicion that they might have been generated using a different mechanism [8] or data objects that are inconsistent with the rest of the data objects [2]. Outliers identified in web data are referred to as web outliers to distinguish them from traditional outliers. The mining process for web outliers is called web outlier mining. The following statements help to motivate awareness in this area: • 80% of people who visited company A’s website also visited at least one other department in the company. • 70% of all pages linked together contain related information. These are frequent patterns established using association rule mining. However, it will also be interesting to know the sites visited by the remaining 20% of visitors who did not visit any department and also the 30% of unrelated pages that are linked together. None of the existing web mining algorithms can provide the above information because they would have been eliminated since they are either considered less frequent or outliers. Identifying such rare patterns can be interesting and lead to the discovery of competitors in e-business; criminal activities in electronic commerce; and most frequently accessed and used pages in web sites. This paper broadly describes web outlier mining as the discovery and analysis of rare and interesting patterns from the web. The differences in information contents of web pages and servers make web outlier mining more challenging than traditional outlier mining. Unlike traditional outlier mining algorithms designed solely for numeric data sets, web outlier mining algorithms should be applicable to data of varying types including text, hypertext, video, etc. Thus, it is impossible to design single algorithm for mining web outliers. Thus, web outlier mining is categorized into three components: web content outlier mining, web structure outlier mining, and web usage outlier mining depending on the source and data types involved in the mining process. The categorization enhances development of content-specific algorithms for mining web outliers. Web usage outlier mining is dedicated to finding outliers in web usage data. Web usage data often referred to as web clickstream data consist of user activities and interactions with the web. The usage data are often captured in servers as server logs, referrer logs, browser logs, etc. The contents of a typical web server may

590

include IP addresses, time-in, pages addresses and references, and time spent on each page. In web structure outlier mining, we concentrate on discovering rare and interesting patterns from the hyperlink structure of the web. It is aimed at finding links whose connecting nodes contained information that are completely unrelated. Mining web content outliers may enhance the quality of hubs and authority sites for a given topic. Authority sites are sites (pages) having quality links to topics of interest whereas hubs are pages linked to by pages addressing the topic of interest. Web content outlier mining concentrates on finding outliers from the web contents of a web page (web document). The web contents of page consist of data of different types including text, image, video, hyperlinks, etc. Web pages that have different contents from the category in which they were taken constitute web content outliers.

1.1 Motivation Web outlier mining has a number of practical applications in electronic commerce. Mining web outliers may lead to the discovery of competitors in electronic business and commerce. Consider the statement: “60% of users who visited the site www.chapters.ca purchased book(s).” The alternative sites visited by the 40% of chapter’s visitors who did not purchase any book can be established using web usage outlier mining. If it is established that some fraction of the 40% bought book(s) from the alternative sites, then those sites are competitors to Chapters. Electronic business competitors can be established using web usage outlier mining irrespective of domain. In addition, web usage outlier mining algorithms may be useful in determining pages that are frequently accessed and used. Similarly, web content outlier mining can be used to determine pages with entirely different contents from their parent web sites. It can be used to identify immerging business potential. For example, mortgage loan facility found on the website of an insurance company will constitute a web content outlier. The content of the pages dedicated to marketing and promoting the loan facility will differ significantly from other pages on the website. The loan facility on an insurance company’s website can be identified using web outlier mining algorithms. Finally, web structure outlier mining may be very useful in improving the quality of hubs and authority sites for a given topic (e.g. SARS). It is achieved by removing unrelated but linked pages using web structure mining algorithms.

1.3 Outline of Paper Section 2 presents an overview of outlier mining. In Section 3 the taxonomy of web outliers with their descriptions is presented. Section 4 presents framework for mining web content outliers. Conclusions and future work are presented in section 5.

2. RELATED WORK Traditional outlier mining techniques include those from statistics and data mining. In statistics, data objects are fitted with standard distributions and outliers are data objects that are found to be different from the remaining data [2, 8]. The statistical techniques require upfront knowledge of the data distribution and the distribution parameters (e.g., mean and variance). Depth-based algorithms for mining outliers are discussed in [10]. Their underlying principle is that shallow layers are more likely to contain outlying data than are deep layers when data is organized into layers. The distance-based outlier concept assigns numeric distances to data objects and compute outliers as data objects with relatively larger distances [1, 11, 12, and 13]. The outlier concept that assigns some degree of outlying called local outlier factor to every object is developed [3, 9]. The local outlier factor depends on the remoteness of an object from its neighborhood. Outliers are objects that tend to have high local outlier factors. In this paper, we establish the presence of outliers on the web and give some practical applications of web outliers. In addition, we propose a general framework for mining them web outliers and provide specific algorithms for mining web content outliers and web usage pattern outliers.

3. TAXANOMY OF WEB OUTLIERS The web consists of data from different sources made up of different data types. The different sources and data types pose a real challenge for automatic discovery of web-based information. For example, algorithms designed for mining outliers from web usage data cannot be applied to the web contents because of the differences in input data types. The taxonomy for web outliers shown in Figure 1 allows content-specific algorithms to be designed for mining outliers from specific sources. Web Outliers

Web Content Outliers

It is obvious web outlier mining has several practical applications. Unfortunately, there is neither formal definition nor algorithms for mining web outliers. It is against this background that we establish the presence of outliers on the web and provide a general framework for mining web content outliers.

1.2 Contributions We introduce the web outlier concept and propose a general framework for mining them. We provide a taxonomy for web outliers and continue with the description of the different types of outliers present on the web. In addition, a general framework for mining web content outliers is also provided.

Web Usage Outliers

Statistical Outliers

Web Structure Outliers

Pattern Outliers

Figure 1: Taxonomy of web outliers Thus, separate algorithms can be designed for mining outliers from web servers as well as from the web pages. Detailed description of the components of the taxonomy is given in subsequent sections with algorithms for mining web usage statistical outliers and web content outliers. Algorithms for mining web usage pattern and web structure outliers are beyond the scope of this paper.

591

3.1 Web Usage Outliers Web usage outliers are those present in web usage data. Web usage data consists of user interactions with the web usually captured in web severs (e.g., server logs, browser log, referrer logs). Web usage data contains irrelevant information that must be removed through preprocessing. Preprocessing is a very important task in all web mining algorithms. We preprocess the data by applying the algorithms and techniques presented in [5]. Web usage data may contain two types of outliers: web usage statistical outliers and web usage pattern outliers each of which has its own unique importance. Web usage statistical outliers exist in summary statistics of web usage data. Access frequencies (the number of accesses per page) generated from web usage data has been the most commonly used summary statistics. Pages with high access frequencies are declared as very important and their contents usually moved to the main pages during review. We argue against the use of access frequencies in determining the importance of pages as some pages may be frequently accessed but not used and as such are not important. The argument is that some web designers use “catchy phrases” in the metadata description of web pages that do not reflect their contents. Thus, such pages are more likely to be returned by web search engines and might have high number of hits. Web users may visit such web pages because they are part of the results of their search but may eventually not use them since their contents are different. The visits may add up to the count of the access frequencies but do not reflect the importance of the page. The second argument is that access frequencies do not provide information on the relationship between pages. In other words there is no information on how pages linked to each other are related. The first problem is addressed by proposing a technique that uses access frequencies and time spent on a page. Thus, combining time and frequency is a better indicator of importance than using the frequency alone. Pages with high time to frequency ratios tend to be more outlying than those with lower ratios and hence more important. Figure 2 illustrates the access frequencies and time spent on web site. Page P5 is the most accessed page and hence the most important page using only the access frequencies. It also has the smallest time to frequency ratio which indicates most users only accessed the page but did not use it. On the other hand, page P3 has the highest time to access frequency ratio which means visitors actually used the page when it was visited. Page

W

Access Freq.

time

P1

20

45

P2

15

35

P3 P4

25 30

60 60

P5

40

20

Figure 2: Access frequency time pair

In designing the model for finding web usage statistical outliers, we assume access frequencies are normally distributed. The test statistic TFi is defined as follows:

Ti f i −1 − Tf

TFi =





σ Tf =

Ti f i −1 − Tf −1 n −1

−1



σ Tf 2

n 



,

Tf

−1

= i =1

, where

(T

i

fi −1

)

n

and Ti is the total time spent on the ith page and fi is ith page frequency of access. Any page with TFi value greater than is an outlier. At confidence level of 95%, the value of is approximately 2. Mining web usage statistical outliers may be useful in helping organizations identify the most widely used pages on their web sites. The contents of these outlying pages should be brought to the main page to attract more users. Further, the keywords in the outlying pages may be added to metadata description to enrich the pages “return-potential” by search engines. 



The second problem is addressed using web usage pattern outlier mining. Web usage pattern outliers are present in web usage trends. Consider the trends: • 60% of users who visited the site www.chapters.ca purchased book(s). • 20% of site visitors accessed Chapter’s main page followed by the book page. Web usage pattern outlier mining concentrates on finding patterns that deviate from normal ones. In the sample trend above, Web usage pattern outliers would be the sites visited by the fraction of the remaining 40% of visitors who purchased book(s) outside Chapters. Designing models for mining web usage pattern outliers is beyond the scope of this paper.

3.2 Web Structure Outliers Web structure mining is the discovery of interesting patterns in the hyperlink structure of the web. The motive of web structure mining is to use the hyperlinks on the web to categorize documents into domains so that pages within the same category can be compared in terms of their similarities and differences. The web is a directed graph with nodes corresponding to web pages and links to edges. A graph is defined as a pair (V, E), where V is a set of vertices corresponding to web pages and E is a set of edges corresponding to the links. Ideally, linked nodes contain related information but there are linked nodes that contain completely unrelated information. Web structure outliers are links with connecting nodes containing unrelated information. Definition: A web structure outlier is defined as set of link(s) with completely unrelated connecting nodes. Given a graph (V, E) where V is a set of pages and E a set of Links. If Eij is the link between Vi and Vj but the contents of Vi and Vj are completely unrelated, then the link Eij constitutes a web structure outlier. Web structure outlier mining involves finding links with unrelated connecting nodes from the web. Mining web structure outliers is more challenging because it requires very efficient graphs for representing the web pages and then subsequently matching the links with adjacent nodes to determine whether or not their contents are related. Mining web structure outliers may lead to the identification of criminals. It can also be used to improve the quality of results returned from web search engines.

592

3.3 Web Content Outliers The web consists of interrelated web pages grouped into different categories depending on their contents. A web content outlier is described as page(s) with completely different contents from similar pages within the same category. For instance, Bombardier is an airplane manufacturer that also manufactures snowmobiles. The company web site has pages dedicated to promoting and marketing snowmobiles. Since airplane manufactures do not generally make snowmobiles the pages on snowmobiles constitutes web content outliers. The web contents have data of varying types including textual, video, pictures, etc. Thus, certain pages may be outliers with respect to particular data type or combination of data types. To reduce the complexity of the problem, this paper considers only data of type text. Any other data apart from text is removed during preprocessing. A formal definition of web content outlier is given next. Definition: Given a set of web documents di (i=1, 2,…, n) each with relative weight wi from category C, the document dj constitutes a web content outlier if wj > wmin. Wmin is a threshold assigned by the miner based on previous experience with similar data. Notwithstanding, the diverse and noisy nature of web contents, we provide a general framework for mining web contents outliers.

4. WEB CONTENT OUTLIER MINING FRAMEWORK The proposed framework for mining web content outliers has four major components: resource extraction, preprocessing, web content outlier detection, and outlier analysis. Figure 4 depicts the web content outlier mining framework. The framework assumes the existence of a dictionary containing the important words belonging to the category of interest. Detailed descriptions of the components of the framework are discussed next.

-Remove unwanted tags -Remove stop-words -Group text

Outlier Detection

-Download pages -Filter pages

4.2 Preprocessing The preprocessing phase transforms the extracted data into a structured form to be used by the outlier detection algorithm. The first step in the preprocessing involves removing unwanted data embedded from the HTML tags. These include data of any type apart from text (e.g., hyperlinks, sound, pictures, etc). In addition, stop-words (words with frequency greater than some user specified frequency) are carefully removed. Special care is taken so that important words that occur more frequently are not removed. Finally, the words are grouped and the HTML tags removed. The preprocessed data is used as input to the outlier detection phase.

4.3 Outlier Detection The goal of the outlier detection phase is discovering rare patterns existing in the web contents. The main inputs are preprocessed data and a dictionary containing the important words of the category of interest. The algorithm assigns weights to words on each page based on the HTML tags that enclosed the words. It also awards penalties to words that are not found in the dictionary but are present on a page. The weight of words on a page are cumulated and compared with a user defined weight for every page in the domain. Outlying documents are those with weights greater than the user defined weight. The rationale for assigning weights and awarding penalties are discussed next.

4.3.1 Assigning Weights

Preprocessing

Resource Extraction

• Filter the downloaded pages to eliminate those that do not belong to the category but have been downloaded. • Analyze the pages to eliminate text that are not enclosed in the desired tags …… , …… and ….. . Achieving desired results at resource extraction phase may require extension to the existing web crawler algorithms. The resulting output is fed as input to the preprocessing phase.

Outlier Analysis

Figure 3: WCO-mining framework

4.1 Resource Extraction Resource extraction is the process of retrieving the desired web pages belonging to the category of interest. This can be achieved using any of the existing web search engines or web crawlers [4]. The crawlers should be very efficient in extracting all pages belonging to a given category. The following tasks are expected to be completed at the end of the extraction. • Download all pages belonging to the category of interest (e.g., health, education)

The HTML structure of the web imposes some hierarchy of importance on text that appears in a document depending on which tags enclose the text. In other words, the importance of a text is somewhat determined by which HTML tags enclose it. This feature of HTML is exploited in assigning weights to the text sources of web pages. The weights are assigned considering only words that are enclosed in HTML tags ….. , …… and …… since they are the most important tags in a HTML document. META and TITLE tags give a better representation of web contents but most web pages do not have META tag descriptions. Larger weights are assigned to words that appear within the META and TITLE than those that appear within the BODY tag. The structure-oriented weighting technique is applied. Each term is represented by the weight of the HTML tag that encloses it and its occurrence frequency [7]. In addition, penalties are awarded to words not found in the dictionary but are present in the pages. The weight for each page is divided by the number of words on the page. The resulting weight is called the Relative Document Weight (RDW) since it accounts for the number of terms within each document. The advantage of RDW is that different documents having varying sizes within the same category can be compared. RDW is given by the function:

593



( p(t ) × w(e ) × TF (t , e , d ))

ni

RDWi =

j

j

k

j

k

i

ek

ni

where ek is the HTML element, w(ek) is the weight assigned to the term tj occurring in the element ek and TF(tj, ek, di) denotes the number of times a term tj is present in the element ek of the HTML document di , p(tj) is penalty awarded against term tj not found in the dictionary, and ni is the number of terms present in a document. The functions w(ek) and p(tj) are defined as: w (e k ) =

p (t j ) =

 

β

1

 

if e = META / otherwise

β >1

TITLE,

1 if t j exists in dictionary λ otherwise, 0 < λ

Suggest Documents