Computing Temporal Trends in Web Documents - Semantic Scholar

9 downloads 194 Views 287KB Size Report
Most existing methods of web content mining assume a static nature of the web documents. This approach is inadequate for long-term monitoring and analysis ...
EUSFLAT - LFA 2005

Computing Temporal Trends in Web Documents

Mark Last Department of Information Systems Engineering Ben-Gurion University of the Negev Beer-Sheva, Israel Email: [email protected]

to continuous changes over time. The characteristics of the trend in certain web content (e.g., an increased occurrence of certain key phrases in the web traffic) may indicate some important changes in the online behavior of web sites, individual users, and virtual communities.

Abstract Most existing methods of web content mining assume a static nature of the web documents. This approach is inadequate for long-term monitoring and analysis of the web content, since both the users' interests and the content of most web sites are subject to continuous changes over time. In this research, we are interested in developing computationally intelligent and efficient text mining techniques that will enable continuous comparison between documents provided by the same source (website, institute, organization, cult, author etc.) or viewed by the same group of users (e.g., university students) and timely detection of temporal trends in those documents. Our approach builds upon the recently developed methodology for fuzzy comparison of frequency distributions. The proposed techniques are evaluated on a real-world stream of web traffic.

Several methods for change and trend detection in dynamic web content are described in [3]. These methods deal with time-stamped documents or versions of the same document downloaded from the web, where a trend can be recognized by a change in frequency of certain topics over a user-specified period of time. The topics are usually associated with some noun-phases occurring in the documents. A k-phrase is defined as an iterated list of phrases with k levels of nesting. According to [3], a typical algorithm used to discover trends in a collection of time-stamped documents has the following major phases: identification of frequent phrases, generating histories of phrases, and finding patterns that match a specific trend.

Keywords: Web Content Mining, Text Mining, Trend Detection, Trend Discovery, Automated Perceptions.

1

Introduction

The rapidly growing area of web mining includes the fields of web usage mining, web structure mining, and web content mining. Detection and monitoring of changes in the dynamic content of the web is one of the important issues in web content mining applications [11], since the interests of the web users as well as the content of most web sites are subject

615

Defining a trend (such as “upwards”, “downwards”, etc.) is a truly subjective and user-dependent task. Lent et al. [10] propose a Shape Definition Language (SDL), which allows the users to specify approximate ("blurry") queries with respect to their trends of interest in a text database. A "shape" is defined as a sequence of time intervals and their associated slopes. Actual trends are discovered by scanning a set of phrases and identifying those that match the given shape query. Frequent phrases are identified using a sequential patterns mining algorithm, where each phrase frequency is measured by the number of documents that contain the phrase. Trends are the k-phrases selected by the shape query with the additional information of the periods where

EUSFLAT - LFA 2005

the trend is supported. The trend discovery method of [10] has been successfully applied to the US Patent Database. A trend discovery system for mining dynamic content of news web sites is presented in [12]. The monitored web pages are downloaded by a dynamic crawler (web information agent) that is activated periodically, e.g., on a daily basis. The preprocessing module of this system implements POS (part-of-speech) tagging to identify the most frequent noun strings that are used for constructing a list of topics. Frequency of a topic is calculated as the number of the news reports mentioning the topic in a given period. Trend analysis is performed by comparing the probability distributions of the news topics in two consecutive periods (e.g., weeks). Individual topic changes are calculated as the absolute differences between the probabilities of topic occurrence. The proposed method identifies the main change factors (news topics) along with the "stable" topics that maintain the same level of importance in both periods. In the case study of [12], the main detected change factors were straightforward indicators of major national and international events occurred during the period of the study. The problem of detecting a change in two consecutive versions of the same web page is discussed by Jatowt and Ishizuka [6]. This problem is closely related to the area of Topic Detection and Tracking (TDT), which is focused on recognition and classification of events from online news streams. The comparison of two versions is done on the sentence level. Two types of textual changes are considered: an insertion and a deletion. Single words and bi-grams are used as selected features (terms). Maximum score is assigned to terms, which are inserted or deleted from a high number of documents within a relatively short period of time. Topic identification and tracking is especially important for monitoring the dynamics of personal web publications such as weblog postings. According to [5], the lifecycle of a topic can include the following sections: RampUp, RampDown, MidHigh, and Spike. Each section is defined by a crisp statistical predicate. Thus the RampUp is defined as "all days in first 20% of post mass below mean and average day during this period below µ−σ/2". The main topics are selected in [5] by manually examining the list of the most frequent individual terms and then expanding the selected

616

terms with co-occurring nouns. The problems of interest include identification of spikes in the number of postings on a specific topic and characterization of individuals by their behavior in the blogspace. Identifying a continuous change in topic frequency across a sequence of consecutive periods is an extremely subjective task, which is much easier for a human eye examining a graphical plot of the frequency data than for numeric statistical techniques. The human conclusions tend to bear some amount of vagueness and are much easier to be described by words (such as "recent upwards trend" or "spike") than by some strict mathematical terms (such as µ + 2σ). In their decisions, people are also trying to rely on the existing expert knowledge, which is usually linguistic and imprecise in its nature. Thus, automated trend detection and monitoring can be seen as a particular case of automated perceptions (see [7]). In this paper, we use the recently developed methodology for fuzzy comparison of frequency distributions [9] to automate the perception of temporal trends in the dynamic web content. The rest of the paper is organized as follows. Section 2 presents our trend detection methodology. A case study based on real-world data of web traffic content is described in Section 3. Section 4 contains conclusions and directions for future research.

2 Automated Perception of Temporal Trends in Web Content In this work, we focus on the problem of detecting trends in the viewed content though it can be easily extended to monitoring of posted content as well. We assume input data in the form of time-stamped documents Dtj downloaded by the same group of web users, where t is a time stamp and j is an index of a document from the period t. The time stamp can be either the posting time (if available and reliable) or the time when the document was downloaded from the web. The trend is measured over a sequence of two or more consecutive periods (days, weeks, etc.). Unlike the traditional time series analysis, where the trend is defined as a single component of the series, we calculate the trend in a period t as a vector of topics k and their respective frequency trends Trendkt (T) measured over T most recent periods.

EUSFLAT - LFA 2005

The list of the monitored topics is associated with the most frequent keywords or keyphrases across the monitored documents. The document keyphrases can be identified by any phrase-identification method such as mining sequential patterns [10][13] or the GenEx algorithm [14]. The relative frequency of a topic k in a period t is calculated as pkt = fkt / Nt, where fkt is the number of downloaded documents containing the corresponding keyphrase and Nt is the total number of documents downloaded in a period t. Similar measures of topic frequency are used in [10] and [12]. Following our previous work on fuzzy comparison of frequency distributions [9], we assume that the difference between proportions (relative frequencies) is a linguistic variable, which can take two values: bigger and smaller, each being a fuzzy set. The following membership functions can be used for a topic k in a period t: µ S (k , t ) = µ B (k , t ) =

1 1 + e β ( d k +α S ) t

1 t

1 + e − β ( d k −α B )

, d ∈ [ − 1, 1 ] ; α S , β ≥ 0

(1)

, d ∈ [ − 1, 1 ]; α B , β ≥ 0

(2)

where dk = pk - pk is the difference between measured proportions (relative frequencies) of the same topic in the current and the previous periods, and αS, αB are the scale factors, which determine the scale of the membership functions (their intersection with the Y-axis). As we show in [9], the scale factors can be used to represent the prior belief about a difference between proportions of a specific topic. β is the shape factor, which can change the shape of the membership function from a horizontal line (β = 0) to the step function (β→∞). In [9], we associate β with the sample size, used for calculating the relative frequency. t

t

t-1

In a general case, the values of αS and αB do not have to be equal to each other. However, if they are equal, the following lemma can be derived from the above definitions of membership functions [8]: Lemma 1: When the compared proportions are equal (dkt = 0) and αS = αB = α, the membership functions of “bigger” and “smaller” are equal to each other and can be calculated by: µ B (k , t ) = µ S (k , t ) =

1 1 + eαβ

(3)

The functions µS (d) and µB (d) representing the author's subjective perception of the difference

617

between proportions are shown graphically in Figure 1. The non-linearity of the membership functions is essential for representing the human way of reasoning with visualized data. Thus, for the linguistic term “bigger”, the decrease in the proportion difference from the maximum value of 1.0 to 0.5 is perceived as much more important than the decrease from 0.5 to 0.0, which, in turn, appears to be more meaningful than the decrease from 0.0 to –0.5. As indicated by [2], the exponential character of human perception has been confirmed by experiments in perceptual psychology. M e m b e rs h ip F u n c tio n S m a lle r

B ig g e r

1 .0 0 0 0 .8 0 0 0 .6 0 0 0 .4 0 0 0 .2 0 0 0 .0 0 0 -1

-0 .5

0

0 .5

1

P r o p o rtio n C h a n g e

Figure 1 Membership functions for “Smaller” and “Bigger” (α = 0.25, β = 8) If there is an increase in topic frequency between the two periods (a topic is emerging), the membership grade of the difference between proportions in the bigger fuzzy set will be close to 1 and its membership in the smaller fuzzy set will be close to 0. An opposite situation will occur when a topic is disappearing, i.e., there is a decrease in its relative frequency. Since a positive (negative) trend over multiple periods assumes a steady increase (decrease) in relative frequency, we can calculate an average trend in a topic k frequency over T most recent periods (T ≥ 2) using the following expression: t

∑[µ

Trend kt (T ) = i =t −T + 2

B

(k , i ) − µ S (k , i )] T −1

(4)

Where µB(k,i) and µS(k,i) are the membership grades of a topic k in period i in the fuzzy sets "bigger" and "smaller" respectively. Effectively, the trend calculated above represents our perception of T-1 differences between the topic frequencies in T consecutive periods. For T = 2, the trend stands just for a change between two consecutive periods and as T becomes larger, it describes longer-term effects of a continuous decrease or increase in some topic popularity.

EUSFLAT - LFA 2005

Since a trend can be truly considered as a subjective, linguistic variable (see Section 1 above), the numeric values of Trendkt(T) calculated by Eq. (4) can be further characterized using a set of linguistic terms. In this paper, we use the following set of five terms, which is partially based on the illustrative alphabet presented in [10]: highly decreasing (HD), slightly decreasing (SD), stable (S), slightly increasing (SI), and highly increasing (HI). The suggested fuzzy membership functions for representing each term are shown in Figure 2. In the next section, the proposed trend detection methodology is applied to the web documents downloaded over time by a pre-defined group of web users.

on the log-log plot. The distribution closely resembles Zipf's law and power law [1].

Linguistic Terms

To find the most significant positive and negative trends, we have computed average fuzzy trends of every keyphrase in each weekly period while varying the number of preceding periods T between 2 and 5. We have not examined trends spanning over more than 5 weeks, since the entire period of data collection was limited to 10 weeks. Membership functions for “Smaller” and “Bigger” have been defined as in Figure 1 (α = 0.25, β = 8). For each value of T, the keyphrases have been sorted in decreasing order of the maximum absolute fuzzy trends and converted into linguistic terms defined by the membership functions of Figure 2 above. Each fuzzy trend has been associated with a term having a maximum membership grade for its value.

1.20

1.00

Membership Grade

0.80

0.60

0.40

0.20

0.00 -1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Trend highly decreasing (HD)

slightly decreasing (SD)

stable (S)

slightly increasing (SI)

highly increasing (HI)

Figure 2 Trend as a Lingustic Variable

3 Case Study: Monitoring Web Traffic Content The fuzzy-based trend detection methodology presented in Section 2 above has been applied to a corpus of 16,331 HTML documents downloaded during the period of 10 weeks by the users of 38 public computers at Ben-Gurion University. All monitored computers are normally used by the same homogeneous group of users – undergraduate students in information systems engineering. NonEnglish (mainly, Hebrew) documents viewed by the same students have been excluded from the analysis. The keyphrases have been extracted from each document using the Extractor 7.2 text summarization engine [4], which is based on the GenEx algorithm [14]. The number of keyphrases per document varied between one and thirty resulting in the total number of 46,494 distinct keyphrases. Using interdocument frequency, the 1,000 most frequent keyphrases in the vocabulary have been chosen for the trend analysis. Figure 3 shows the distribution of the number of documents containing each keyphrase

618

Keyphrase Frequency (Number of Documents)

10,000

1,000

100

10

1 1

101

201

301

401

501

601

701

801

901

Keyphrase Rank

Figure 3 Distribution of keyphrase frequency

The top-ranking keyphrases having maximum positive trends over two consecutive periods (T = 2) are shown in Table 1. All maximum trends of these keyphrases have been detected between periods 3 and 4. A brief look at the list of keyphrases reveals that they all seem to be related to some information about Java, which had an increased demand by the students during week 4 of the study. Interestingly enough, there was a sharp decrease in the frequency of the same keyphrases one week later (see the column of Week 5 in Table 1).

EUSFLAT - LFA 2005

Table 1 Maximum Positive Trends (T = 2) TREND

TREND

KEYPHRASE

(Week 4)

(Week 5)

java

HI

HD

sun

HI

HD

licensing

HI

HD

articles

HI

HD

sun microsystems

HI

HD

If we examine longer-term trends over a span of five consecutive weeks, we find some additional phenomena. Thus, Table 3 shows that in week 8 we could detect a slight decrease in demand for Javarelated information, which started after week 4 (see Table 1 above). This trend can be visually observed on the frequency plot of the "java" keyphrase in Figure 4. However, a simple moving average (SMA) calculated over the same number of consecutive periods (five) would not detect this trend before week 9. Such a delay in response time is certainly not desirable in most topic tracking applications.

faqs

HI

SD

Table 3 Maximum Negative Trends (T = 5)

platform

SI

SD

feeds

SI

SD

technical manuals

SI

SD

sdk

SI

SD

TREND KEYPHRASE

(Week 8)

platform

SD

faqs

SD

Table 2 reveals a slightly more complex picture for the largest negative trends over two consecutive periods. Apparently, there was a slight decrease in the use of Hotmail and MSN sites, especially with respect to news and email, between weeks 1 and 2. Later on, between weeks 6 and 7, there was a similar decrease in the usage of web mail services, this time unrelated to Hotmail and MSN.

sdk

SD

character

SD

sun microsystems

SD

articles

SD

licensing

SD

sun

SD

Table 2 Maximum Negative Trends (T = 2)

java

SD

TREND

TREND

KEYPHRASE

(Week 2)

(Week 7)

hotmail

SD

SI

compose

SD

SD

msn

SD

SI

java 40.0%

35.0%

30.0%

Frequency

25.0%

20.0%

15.0%

bcc

SD

SD

smile

SD

SI

msn israel

SD

SI

signature

SD

SD

news

SD

S

message html editor

SD

SD

4

outgoing message html

SD

SD

inbox

S

SD

This paper has introduced a new, fuzzy-based method for identifying short-term and long-term trends in dynamic web content. The proposed methodology is based on a commercial keyphrase extraction tool and a computationally intelligent

10.0%

5.0%

0.0% 0

1

2

3

4

5

6

7

8

9

10

Week java

SMA5

Figure 4 Weekly Frequency of the Keyphrase "java"

619

Conclusions

EUSFLAT - LFA 2005

method for fuzzy comparison of keyphrase interdocument frequencies in consecutive periods. The approach has been demonstrated on a corpus of time-stamped documents downloaded by web users.

[7] M. Last and A. Kandel, “Automated Perceptions in Data Mining," 1999 IEEE International Fuzzy Systems Conference Proceedings, Seoul, Korea, August 1999, Part I, pp. 190 – 197.

Directions for future research include application of the fuzzy-based methodology to tracking consecutive versions of specific websites, developing advanced methods for keyphrase and topic identification, and exploring correlations between trends of multiple keyphrases. We also intend to compare our methodology to statistical techniques of time series filtering.

[8] M. Last and A. Kandel, "Perception-based Analysis of Engineering Experiments in Semiconductor Industry," International Journal of Image and Graphics, Vol. 2, No. 1 (2002) 107 – 126.

Acknowledgments This work was partially supported by the National Institute for Systems Test and Productivity at University of South Florida under the USA Space and Naval Warfare Systems Command Grant No. N00039-01-1-2248. Omer Zaafrany and Yaniv Makover from Ben-Gurion University have collected and prepared the data for this study.

References [1] L. A. Adamic, Zipf, "Power-laws, and Pareto - a ranking tutorial, " [http://www.hpl.hp.com/research/idl/papers/rank ing/ranking.html] [2] J.M. Chambers, W.S. Cleveland, B. Kleiner, P.A. Tukey, Graphical Methods for Data Analysis, Chapman & Hall, 1983. [3] G. Chang, M. J. Healey, J. A. M. McHugh, and J. T. L. Wang, Mining the World Wide Web - An Information Search Approach, Kluwer Academic Publishers, 2001. [4] Extractor Version [http://www.extractor.com/]

7.2

[5] D. Gruhl, R. Guha, D. Liben-Nowell, A. Tomkins, "Information Diffusion Through Blogspace," SIGKDD Explorations, Vol. 6, Issue 2, December 2004, pp. 43 – 52. [6] A. Jatowt and M. Ishizuka, "Summarization of Dynamic Content in Web Collections," in Knowledge Discovery In Databases PKDD2004: 8th European Conference on Principles and Practice of Knowledge Discovery in Databases J.-F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi, Eds., Lecture Notes in Computer Science, Springer Verlag, pp. 245254, 2004.

620

[9] M. Last and A. Kandel, "Fuzzy Comparison of Frequency Distributions," in Soft Methods in Probability, Statistics, and Data Analysis, P. Grzegorzewski et al. (Editors), Physica-Verlag, Advances in Soft Computing, pp. 219 – 227, 2002 [10] B. Lent, R. Agrawal, and R. Srikant, "Discovering Trends in Text Databases," in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 227-230, California, 1997. [11] B. Liu, K. C.-C. Chang, "Editorial: Special Issue on Web Content Mining," SIGKDD Explorations, Vol. 6, Issue 2, December 2004, pp. 1-4. [12] A. Mendez-Torreblanca, M. Montes-yGomez, and A. Lopez-Lopez, "A Trend Discovery System for Dynamic Web Content Mining," Proceedings of CIC-2002, 2002. [13] R. Srikant and R. Agrawal, "Mining Sequential Patterns: Generalizations and Performance Improvements, Advances in Database Technology," 5th International Conference on Knowledge Discovery and Data Mining (KDD'95), Montreal, Canada, 1995, pp. 269-274. [14] P.D. Turney, "Learning Algorithms for Keyphrase Extraction," Information Retrieval, 2 (4), 2000, pp. 303-336.

Suggest Documents