An online blog reading system by topic clustering and ... - CiteSeerX

9

An Online Blog Reading System by Topic Clustering and Personalized Ranking XIN LI Peking University JUN YAN Microsoft Research Asia WEIGUO FAN Virginia Tech NING LIU Microsoft Research Asia SHUICHENG YAN National University of Singapore and ZHENG CHEN Microsoft Research Asia

There is an increasing number of people reading, writing, and commenting on blogs. According to a recent survey made by Technorati, there are about 75,000 new blogs and 1.2 million new posts everyday. However, it is difficult and time consuming for a blog reader to find the most interesting posts in the huge and dynamic blog world. In this article, an online Personalized Blog Reader (PBR) system is proposed, which facilitates blog readers in browsing the coolest and newest blog posts of their interests by automatically clustering the most relevant stories. PBR aims to make a user’s potential favorite topics always ranked higher than those nonfavorite ones. This is accomplished in the following steps. First, the system collects and provides a unified incremental index of posts coming from different blogs. Then, an incremental clustering algorithm with a flexible half-bounded window of observation is proposed to satisfy the requirements of online processing. It learns people’s personalized reading preferences to present a user with a final reading list. The experimental results show that the proposed incremental clustering algorithm is effective and efficient, and the personalization of the PBR performs well. Categories and Subject Descriptors: H.4.3 [Information Systems Applications]: Communications Applications—Information browsers; H.5.2 [Information Interfaces and Presentation]:

Authors’ addresses: X. Li; email: [email protected]; J. Yan; email: [email protected]; W. Fan; email: [email protected]; N. Liu; email: [email protected], S. Yan; email: [email protected]; Z. Chen; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2009 ACM 1533-5399/2009/07-ART9 $10.00 DOI 10.1145/1552291.1552292 http://doi.acm.org/10.1145/1552291.1552292 ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:2

•

X. Li et al.

User Interfaces—Prototyping, user-centered design, interaction styles; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Clustering General Terms: Design, Performance, Human Factors Additional Key Words and Phrases: Blog, topic, story, personalization, ranking, connected subgraph, link information, content information ACM Reference Format: Li, X., Yan, J., Fan, W., Liu, N., Yan, S., and Chen, Z. 2009. An online blog reading system by topic clustering and personalized ranking. ACM Trans. Internet Technol. 9, 3, Article 9 (July 2009), 26 pages. DOI = 10.1145/1552291.1552292 http://doi.acm.org/10.1145/1552291.1552292

1. INTRODUCTION A blog (weblog) is a personal journal on the Web. New posts are constantly published on the blog. A post (blog article) is also called a story. Stories express as many different subjects and opinions as there are people writing them. Some blogs are highly influential and have enormous readerships while others are primarily intended for a close circle of family and friends. The power of blogs is that they allow millions of people to easily publish their ideas, and millions more to comment on them. The Pew Internet1 study estimates that about 11% or about 50 million of Internet users are regular blog readers. According to Technorati’s2 data, there are about 75,000 new blogs a day; while bloggers—people who write blogs— update their blogs regularly; there are about 1.2 million new posts daily, or about 50,000 blogs updated per hour. So it is highly costly for a user to finish reading all new posts or to find interesting posts in the enormous world of blogs. We refer to this problem of finding interesting blogs as the blog mining problem. Motivated by the rapid growth of blog space, we develop in this article, a system called Personalized Blog Reader (PBR) to satisfy the requirements of real tasks faced by a common user. There are three main functions implemented in our PBR system: (1) The first is a crawler. The system collects posts from a given user’s favorite blogs (RSS feeds). Thus it provides a unified index of all posts that come from different blog sources. (2) The second function is a blog analyzer, which clusters all crawled stories into topics. Since there may be lots of posts from the same or different RSS feeds that talk about the same event during a period of time, the blog analyzer helps users quickly access large quantities of information without duplicate reading of the same event and easily discover popular topics by the number of posts in that cluster. By examining the main story of a topic, a user could decide whether to read more blog posts related to that topic. (3) The third main function is a personalized ranking system. There are large numbers of new topics coming out all the time. Though repeated news is 1 http://www.pewinternet.org/PPF/r/77/press

release.asp

2 http://www.technorati.com/weblog/2006/02/81.html

ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

Online Blog Reading System by Topic Clustering and Personalized Ranking

•

9:3

handled by the analyzer during the clustering procedure, it is still time consuming for people to go through all the topics to find the ones in which they are interested. So our system learns a user’s personal reading preferences, such as which kinds of topics he/she most cares about, which words he/she is sensitive to. When new posts come, all the user’s potential favorite ones will be ranked higher. In line with these three functions, we divide the entire PBR system into three parts, with each part supporting one function. To keep track of blogs subscribed by blog readers, our crawler is an automated program that seeks out RSS feeds available on the Internet. It monitors and detects updates of RSS feeds. Once a feed has new posts, the crawler will fetch the new ones and send them to the blog analyzer. For story clustering, we propose an incremental clustering algorithm with two steps: (1) clustering on the static dataset to capture similarity between stories not covered in the same window of observation; (2) clustering on the dynamic data stream to satisfy the requirements of online processing. Hyperlink as a common component in Web pages has played an important role in Web search and analysis [Page et al. 1998; Kleinberg 1999]. Hyperlinks are even more significant for blogs, since bloggers frequently link to and comment on other blogs, which creates the sense of timeliness and connectedness one would have in a real conversation. Using link information, we can quickly cluster stories into topics. Because linked stories are most likely to talk about the same or a related event, this technique greatly reduces the topic clustering complexity in contrast to traditional pure content-based clustering algorithms. To rank topics according to the personal taste of a user, we track all actions a user performed when using our system, and use them to build a user’s personal profile, which describes and represents his/her personal blog reading habit. The rest of our article is organized as follows. Section 2 gives a short survey of related works. Section 3 outlines the overall structure of our system. Section 4 presents the background knowledge for our system. Section 5 describes our story clustering approach. Section 6 introduces our personalized ranking method. Section 7 discusses the experimental results and Section 8 gives our conclusions and future work. 2. RELATED WORK 2.1 Graph Clustering Clustering datasets into disjoint groups is a problem arising in many domains. From a general point of view, the goal of clustering is to reduce the amount of data by categorizing or grouping similar data items together. Often, datasets can be represented as weighted graphs, where nodes correspond to the entities to be clustered and edges correspond to the similarities between those entities based on certain similarity measures. The problem of graph clustering is well studied and the literature on the subject is very rich. The best known graph clustering algorithms attempt to optimize specific criteria such as k-median, minimum sum, minimum diameter, and so forth [Bern and Eppstein 1996]. These techniques have the constraint ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:4

•

X. Li et al.

that the number of clusters or the number of points in each cluster has to be specified in advance (as is common with clustering algorithms). The problem is hard, and several heuristic approaches [Karypis and Kumar 1998] and approximation algorithms [Bansal et al. 2004; Giotis and Guruswami 2006] have been proposed. These algorithms are very interesting in theory, but may be far from practical. Other algorithms are application-specific and take advantage of the underlying structure or other known characteristics of the data. Examples are clustering algorithms for images [Wu and Leahy 1993] and program modules [Mancoridis et al. 1998]. Although there is a large body of existing works on graph clustering, they are not the cases in the applications we have in mind. Because we are developing an online blog reading system, the rapid speed of clustering and the consistency of content within each topic cluster are the central considerations instead of the intrinsically graph partitioning problem itself. 2.2 Social Networks Analysis Social network analysis (SNA) is the study of mathematical models for interactions among people, organizations, and groups. Grounded in graph and system theories, it has developed over the past decades and proven to be a powerful tool for studying networks. SNA is suited to the Web environment, because explicit hyperlinks make it easy to capture the directed interactions and relationships between people and to discover their roles. Gibson et al. [1998] developed a technique to identify hyperlinked communities in Web environments, which includes the identification of hubs (strong central points with high numbers of outbound links) and authorities (highly referenced pages). Most recently, SNA methods have begun to be applied to weblogs. Kumar et al. [2003] observe and model temporally concentrated bursts of connectivity within blog communities over time, concluding that blogspace has been expanding rapidly since the end of 2001, “not just in metrics of scale, but also in metrics of community structure and connectedness.” Adar et al. [2004] identify blogs that initiate “information epidemics” and visualize the paths specific infections take through blogspace. Marlow [2004] uses social network analysis to identify “authoritative” blog authors, and compares them with measures of opinion leadership and authority in the popular press, and Delwiche [2005] studies the most authoritative authors. Although the popular collective term blogosphere originally implied a dynamic, cross-linked social network, Herring et al. [2005] find that less connected or unconnected blogs are in the majority on the Web. So, SNA methods cannot find all implied relationships without explicit link structures between blogs. Moreover, the goal of our online blog reading system is to enable the user to quickly find and scan interesting topics rather than check the communities in blogspace. So we are not totally following SNA in this article. 2.3 Personalization Generally, personalization involves a process of gathering information during interaction with a user and then delivering a more user-tailored service to that ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:5

user, especially in the area of Web services [Bonett 2001]. There have been quite a few research studies of personalization on Web Search. To name but a few, Jeh and Widom [2003] proposed a personalized PageRank by linear combination. Qiu and Cho [2006] proposed learning the user profile as a topic preference to merge personalization and topic-sensitive PageRank. Some research uses personalized output filtering of the results of Web searches, for example Liu et al. [2004]; Ferragina and Gulli [2005]. A typical personalization technique is to first aggregate a user’s information to learn a user profile and then to organize items according to this profile. This is the basis of automated collaborative filtering [Balabanovich and Shoham 1997] and recommendation strategies where clustering is used to aggregate user data [Sarwar et al. 2002; Kelleher and Bridge 2004]. Information overloading is a general problem in the Web as well as the blogspace. So in this article, we show personalization using similar techniques but integrating new characters within blogspace. 2.4 Blog Processing The blog is becoming the most popular form for people to represent themselves freely on the Web. The decentralized and independent nature of blogging makes it very difficult to organize and categorize. An example of a recent and successful online service using blogs’ category labels is Technorati.3 Technorati fetches all blog posts that have user-defined categories associated with them and offers Web pages containing all posts tagged under the same category label. However it requires the user to know precisely what he/she is looking for and this cannot always be the case. Researchers on the Semantic Web project have proposed frameworks in which blogs are marked up using machine-readable metadata, written in a language such as RDF to facilitate cross-blog indexing [Cayzer 2004; Karger and Quan 2005]. In contrast, tags have achieved widespread acceptance by bloggers due to their simplicity [Quintarelli 2005]. Tags, often one or two words long, are used to describe a set of posts in a blog. But tags are defined locally and there is no globally agreed list of tags. Brooks and Montanez [2005] find that tags are poor at retrieving documents with similar content, and Hayes C. et al. [2006a] identify that only a fraction of tags are used repeatedly. At a local level, bloggers themselves often maintain a list of links to blogs they find interesting, so some research [Avesani et al. 2005; Herring et al. 2005] tries to use the list of links to recommend posts similar to the active blogger’s current interests. This technique assumed a static relationship over time between a community of bloggers and the topics they share. But Hayes et al. [2006b] observe quick user drift and topic drift phenomena, and find that blog domain moving frequently from topic to topic. To seek stable clusters in the blogosphere, Bansal and Koudas [2007] present efficient algorithms to identify keyword clusters—the set of keywords frequently appearing together and forming a cluster. So they deal with single words and calculate the relationship between each pair of words. They also present an online system4 to 3 http://www.technorati.com/ 4 www.blogscope.net


9:6

•

X. Li et al.

show keyword correlations, burst synopsis sets and additional analysis of the Blogosphere [Bansal and Koudas 2007]. Tse-Ming Tsan et al. [2006] introduced a personalized blog recommendation system using the user value model, semantic analysis, and SNA. They combine these three different kinds of clues to calculate a score for each blogger and blog article pair. The score shows the estimated tendency of a blogger to be interested in an article. However, the authors also said current approaches may not be comprehensive enough since the way people use blogs continues to evolve and matching accuracy is the next topic to address. In our opinion, this study provides an interesting system architecture and analysis model, but may be far from working well in practice. Findory.com once performed personalization based on user reading and clicking behavior. When a person found and read an interesting article on Findory, that article would be shared with any other Findory readers who would likely be interested. Because of the exponential growth in the use of blogs over the past few years, much research and many publications come from varied sources. In contrast to previous work, we develop an online blog reading system in this article. We first design a fast online clustering algorithm to capture interesting topic clusters and avoid repeated reading automatically and temporally, which is different from studies on tag categorization or keyword tracking and matching in blogspace. The clustering algorithm makes the best of blog characters, to avoid the complicated calculation in traditional clustering algorithms. Then, we silently learn and build a user profile to model a user’s personal reading habits. In contrast, some previous blog reading systems, such as TECHMEME,5 TailRank,6 and Megite,7 may lack the ability to learn personal interests automatically, based on a user’s reading behavior. The main contributions of our system are summarized as follows: (1) Propose an incremental clustering algorithm by utilizing both link and content information based on a novel flexible window model (see Section 5) to facilitate online story clustering. (2) Recommend an optimal entry of each topic to prevent a user from duplicate reading of the same event. (3) Build a user profile to automatically learn a user’s reading habits and to give a personalized ranking list. (4) Provide a uniform interface for a user to read and browse posts from all of his/her interested blogs. 3. SYSTEM OVERVIEW In this section, we first give an overview and some detailed functions of our proposed system. To avoid ambiguities in terminology, we give some definitions first. A story is an article, an item, or a post written by a blogger. It may be a news posting or a personal comment on any topic. A topic is an event. Some stories 5 http://tech.memeorandum.com/ 6 http://tailrank.com/ 7 http://megite.com/



•

9:7

Fig. 1. System architecture.

may talk about the same topic during a short time period. For example, a cluster of stories may semantically talk about the same topic, such as discussions about mad cow cases in USA. The architecture of our system is shown in Figure 1. There are three main modules: crawler, analyzer, and ranker. When the system starts to run, it follows these steps: — User subscribes to all of his/her interested blogs (RSS feeds) at first, like Official Google Blog, boingboing.net, engadget.com, and so on. Of course he/she can dynamically add or remove any blog available on the Web. — The crawler collects all new stories steadily posted on a user’s subscribed blogs to keep the local collection fresh. To keep tracking blogs, our crawler is an automated program that detects the update of RSS feeds in real time. All new stories are fetched and stored in a story database. — Once the story database gets updated, the analyzer runs the incremental story clustering algorithm to obtain new topics using the current crawled database, and selects a main story for each topic. The main story is the recommended one, which describes the event compared with other stories on the same topic. Thus a user only needs to read the main story to get a glimpse of the topic. — Rank topics by a user’s personal preferences, which are learned from a user’s history log file. For example, by tracking down that a user frequently clicked on stories about news or comments on the World Cup, our system will rank all ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:8

•

X. Li et al.

Fig. 2. System user interface.

topics related to the World Cup higher than the other topics, and recommend to the user his/her potential favorite story on each topic as the main story. The user interface (UI) of our system is shown in Figure 2. In the right panel of the figure, stories are clustered into topics. Each topic has a main story placed at the top of each cluster with a short summary. For example, “Palm Treo 700p Released for Verizon Wireless” is the recommended main story of the first topic in Figure 2, and other stories on the same topic are listed below the main story as references. Topics are ranked based on a user’s personal profile or using one of the default methods. Besides those main function modules, users can add any RSS feeds into the system easily via “Subscribe” options at the top of the page and manage all subscribed feeds later via “Channel” options on the left panel of the UI page. A user can also save or tag any one of stories into different folders, such as “Favorite,” “Archive,” “UnRead,” or “ToRead,” by choosing the corresponding options listed below each story’s title. With these options, users can always preserve their interesting stories even though these stories may become outdated later. Our system can learn a user’s reading profile more accurately and transparently based on these tagged stories. 4. BACKGROUND In this section we introduce the background used in this article. In Section 4.1 we introduce the graph of stories. In Section 4.2 we introduce how we extract ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:9

content information of stories incrementally, and introduce our similarity measurement for stories in Section 4.3. 4.1 Connected Subgraph In the following, we introduce the graph model, which characterizes stories and topics to better understand the relationship between them. Given a story stream coming from a set of blog feeds, and a time window with a fixed size, the collected stories within this window can be represented as a directed graph G(V, E), where V = {v1 , v2 , . . . , vn } stands for vertices, and E = {e1 , e2 , . . . , ek } stands for edges. Each vertex vi (i = 1, . . . , n) corresponds to a story, and each directed edge ej (j = 1, . . . , k) is a hyperlink between two endpoints (stories). The edge is weighted by the similarity between the stories represented by its two endpoints. G(V, E) is composed of a set of connected subgraphs, which can be represented as G = {G1 , G2 , . . . , Gm }, where Gi ∩ Gj = φ. Gi (Vi , Ei ) (i = 1, . . . , m) is a connected subgraph, which means that there is at least one path between m any pair of vertices. Here Vi = {vi1 , vi2 , . . . , vini } and i=1 ni = n. We will show that after applying our clustering algorithm, each topic could be represented as a revised connected subgraph Gi (Vi , Ei ). 4.2 Incremental TFIDF To characterize each vertex vi in G(V, E) from the content perspective, we model V based on TFIDF [Salton and McGill 1983] because TFIDF is the popular technique for document representation and term weighting in information retrieval and Web search. We extract the title and description fields of each story into a “bag of words” model [Baker and McCallum 1998], which is called a document. Terms are stemmed [Kantrowitz et al. 2000] and stop-words are removed [Koller and Sahami 1997]. Then we generate term-frequency vectors using the Vector Space Model (VSM) [Singhal and Salton 1995]. Through TFIDF indexing, each story vi , is denoted by a term vector, and the frequency of a term in a document (TF) is weighted by the inverse document frequency (IDF). The document frequency df(m) denotes the number of documents in the collection that contain term m. As we are building an online blog reading system, we use the incremental TFIDF model (shown in Equation (1)) as proposed in Brants and Chen [2003]. There, dft (m) is the document frequency of term m at time t, and dfCt (m) is m’s document frequency in the newly added set of stories Ct: dft (m) = dft−1 (m) + dfCt (m).

(1)

In this model, document frequencies are not static but change when a new set of documents Ct is added to the model. New words in the vocabularies are assigned weights in accordance with their usage in new documents. 4.3 Similarity Calculation Based on the background knowledge we introduced, each story vi in our graph representation is a vector consisting of normalized incremental TFIDF term weights, which can be used to measure the similarity between different stories. In our current implementation, we use the Cosine similarity as in Equation (2) ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:10

•

X. Li et al.

Fig. 3. Window of observation.

to calculate the similarity between any two stories vik and vjl , which belong to connected subgraphs Gi and Gj respectively. vik , vjl vik · vjl (i, j = 1, . . . , m; k = 1, . . . , ni ; l = 1, . . . , nj ),

sim (vik , vjl ) =

(2)

where · , · represents the inner product, and · represents the L2-norm, which is commonly used in vector algebra and vector operations.8 5. STORY CLUSTERING In this section we present story clustering, which is used in our system to help the user quickly access large quantities of information without duplicate reading of the same event. The posted stories can be treated as a continuous data stream. To deal with such streaming data, we need to design a time-aware mechanism that allows online processing. We employ an appropriate window of observation in this work. There are two things to be taken into account when using an observation window: — Stories that are too outdated should be removed to keep the system always current for a user. But the potential semantic relationships between stories that are not too old and incoming stories, should be captured. — When new stories come within the window, the time and space complexity of the clustering algorithm should allow real time online processing. Considering these two constraints, we introduce a flexible half-bounded window, which has a starting point but dynamically extends with no ending point during the period of observation. For example, when we run the system for the first time in Figure 3, t0 is the starting point of the first window of observation. We collect an initial story set containing stories posted before t0, and first cluster them into topics using a static clustering algorithm (Sections 5.1 and 5.2). Then the length of this window grows as time goes by. Whenever a new story is detected during this process, we will call the dynamic clustering algorithm (Section 5.3) immediately to combine it into the initial topics. To prevent the window of observation from growing infinitely and keep the system current, 8 http://mathworld.wolfram.com/L2-Norm.html



•

9:11

Fig. 4. Link problems.

we start a new window at time t1 to remove the outdated stories posted in the past. However, to capture the potential semantic relationships between old stories and incoming stories, such as a series of reports on the situation of the World Cup, we will preserve the stories in a ξ adjacent area instead of removing all in the last window of observation. In Figure 3, the real starting point of new window is tl instead of t1. So there are essentially two kinds of stories within this new window of observation: static ones in adjacent area ξ and dynamic ones coming after t1. Correspondingly, we will call two clustering algorithms based on the static and dynamic stories to satisfy the second constraint: real time online processing. We next introduce the clustering algorithms. Section 5.1 gives a rough but quick clustering method based on link structure among stories. Section 5.2 revises the clustering results by adding content information. We also discuss why and how our method ensures high clustering precision. In Section 5.3 we propose an incremental clustering algorithm for the dynamic incoming stories. This solution takes into account the processing time of online processing. We will also explain how to discover temporal relationships between stories posted before the adjacent area of the current observation window. 5.1 Link Structure Analysis As mentioned in our introduction, blogs allow any people to publish their ideas and have enormous readership. If there is a hyperlink between two stories, using our graph notation, the hyperlink can be represented by an edge pointing from vertex v1 to v2 . This link can be treated as v1 ’s comments or opinions on v2 after v1 reads v2 ’s content. So there is a good chance that v1 is talking about the same event as v2 . With this assumption, we cluster stories linked together into one topic, which means that we extract all connected subgraphs from the original story graph G (V, E) and treat each subgraph as being on the same topic. However, in practice there are four phenomena that may hurt the performance if we cluster stories by link structure only. We name them hub link, noisy link, missing story and missing link. The first problem is the hub link. A hub link means a hyperlink from the hub node, which contains many such hyperlinks pointing to other stories irrespective of whether these stories are related in content. By extracting connected subgraphs, we may wrongly cluster stories that talked together about different events. In Figure 4(a), stories v3 and v4 actually refer to two different events. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:12

•

X. Li et al. Table I. Link-Based Initial Clustering Algorithm

Algorithm. link-based initial clustering Input: the original story graph G(V, E) Output: all connected subgraphs {G1 , G2 . . . Gm } 1. assign a weight to each edge in G(V, E) by calculating similarity between its two endpoints using function (2); 2. remove edges whose weights are below λ, then G(V, E) is revised to G (V, E ) with no hub link or noisy link between any two vertices; 3. extract all connected subgraphs {Gi } (i = 1, . . . , m) from G (V, E ) as the initial m topics.

Because they are connected by hub node v6 , they will be wrongly clustered together. Another problem is the noisy link. A noisy link is any connection between two unrelated stories. It is a more general problem than a hub link. As shown in Figure 4(b), v2 has a hyperlink to v1 , but they actually focus on different topics. Both the hub link and noisy link problems will decrease the cluster’s performance by clustering unrelated stories together. The problem of missing story is shown in Figure 4(c). Story v4 has a link pointed to story v3 , and story v3 is pointed to story v1 . So v1 , v3 , v4 should be clustered into the same topic. However, if we didn’t get story v3 in our crawled dataset—which frequently happens, as we only provide a user with stories from blogs he/she subscribed to instead of all blogs in the world—stories v1 and v4 will be incorrectly clustered into two topics even though they talk about the same event. The last one, the missing link problem, is similar to missing story. See the example in Figure 4(d). Story v4 actually comments on story v1 , or they talk about the same event. However, if there is no explicit hyperlink between them, they can hardly be clustered into the same topic by the linkbased approach. Missing story and missing link both decrease the clustering algorithm’s performance by not clustering related stories together. As discussed above, if we want to get good clustering results, we must solve all these four potential problems by leveraging on information beyond link structure. 5.2 Revise Results by Content Information To deal with hub link and noisy link problems, and ensure that any edge in the graph exactly represents a strong semantic correlation between its two vertices, we filter out all edges whose weights are below a predetermined similarity threshold λ from the original graph G(V, E). After that, only edges with comparatively high weighting value will be kept. Then we extract all connected subgraphs from the revised graph G (V, E ), where E is a subset of E. This work can guarantee our system’s performance for story clustering. The revised algorithm is summarized in Table I. We have clustered stories into topics using the algorithm shown in Table I. However, the problems of missing story and missing link still need to be resolved. Here we introduce the most prevailing approach of new event detection, proposed by Allan et al. [1998] and Yang et al. [1998], in which documents are processed by an online system. In such online systems, when receiving a document, the similarities between the incoming document and the known ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:13

Table II. Topic Merger Algorithm Algorithm. topic merging Input: all connected subgraphs Gset = {G1 , G2 . . . Gm } Output: the topic set {G1 , G2 . . . Gk } 1. 2. 3. 4. 5. 6.

for each Gi in Gset do assign weight wi to Gi by (3) end for; Rank Gi in Gset by weight wi in a descending order; for each Gi in Gset , do if Gi is visited then i++ ; else compute Gi ’s similarity with all Gj (j > i) using the following formula S(Gi∗ , Gj ) =

nj ni∗ wi∗ wj sim(vi∗ k , vjl ) |Gi∗ | · |Gj | k=1 l =1

(4)

7. choose the biggest similarity j∗ = arg max S(Gi∗ , Gj ) for all j =i∗ ; 8. if S(Gi∗ , Gj∗ ) > γ then 9. merge Gi into Gj as topic Gij ; m = m−1; 10. Adjust the position of G ij to keep the set in descending order; 11. else mark Gi visited; 12. end if 13. end if 14. end for 15. return set {G1 , G2 . . . Gk }

events (sometime represented by centroid values) are computed, and then a threshold is applied to determine whether the document is the first story of a new event or a story of some known event. In our situation, after extracting each connected subgraph Gi (i = 1, 2,. . . , m) from G (V, E ), we assign a weight to each subgraph by: max{nj } − ni wi =

j

max{nj }

i, j = 1, 2, . . . , m.

(3)

j

Here ni is the number of vertices in Gi . Value wi measures the scale of subgraph Gi compared with the biggest connected subgraph. The value of wi is less than one, but bigger than or equal to zero. Smaller wi means bigger Gi in size. We rerank topics by wi in a descending order. Then each time, we choose an unaccessed topic from the beginning, and calculate its similarity with all other topics after it. If the largest similarity is no less than a predetermined merging threshold γ , we merge the two corresponding topics together. The idea of doing ranking first is based upon our experiential assumption that small topics are most likely to be merged into other topics, and it is a greedy technique to make a lowest cost for calculating similarity. Table II gives the topic merging algorithm. 5.3 Incremental Clustering on Dynamic Stories In the previous two sections, we discussed the clustering algorithm for a static dataset. In this section, we propose an incremental clustering solution to deal with the dynamic story stream. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:14

•

X. Li et al. Table III. Dynamic Incremental Clustering Algorithm

Algorithm. dynamic incremental clustering Input: the known topic set Tset = {G1 , G2 . . . Gk }, a new store vnew Output: updated topic set T set 1. if vnew links to some Gi in Tset then 2. for each such link L do 3. check whether L is a hub or noisy link by (2) and get a value simL 4. end for 5. if L* has the max value simL∗ and simL∗ >λ then 6. combine vnew into L*’s linking topic G*i . return 7. end if 8. end if 9. else detect whether vnew means a new event by calculate its similarity with all topics in Tset S(vnew , Gi ) =

ni 1 sim(vnew , vik ) |Gi | k=1

(5)

10. if max S(vnew γ , Gi ) > then 11. Combine vnew into Gi 12. else assign vnew as a new topic in Tset 13. end if 14. end if

When a new story, vnew , comes, instead of rebuilding all topics covered by the total of the old stories combined with vnew , we first check whether vnew has an out-link pointing to any story vi within the current time window. There won’t be any in-link of vnew from old story vi , i = 1, 2, . . . , n, since vnew is posted after vi in the story stream. If it does have an out-link, and the link is not a hub or noisy link, we combine vnew into vi ’s topic. If vnew has more than one such out-link pointing to stories belonging to different topics, then we combine vnew into the topic having the strongest semantic correlation with it. If vnew has no such out-link at all, we must detect whether it is a new event by calculating its similarity with all established topics. When the largest similarity is no less than a given threshold, we combine it into the corresponding topic. Otherwise we assign it as a new topic. By doing this step iteratively, we can quickly finish the clustering step with new stories. The new incremental clustering algorithm is shown in Table III. This algorithm is linear in the number of topics. If the new incoming story has a non-hub out-link pointing to a topic that already exists, it can immediately be directly combined. So the algorithm takes into account the time appropriate for online processing. It can effectively obtain the relationships between new stories and stories in both the current window and adjacent area. However, it may miss the temporal relationships with stories posted before the adjacent area. Since our system is a personalized blog reading tool, all user’s favorite or interesting stories could be recorded in a user’s history log. Though his interested, yet outdated stories could not be directly clustered with new ones, by learning his personal favorite profile, we can still rank the corresponding later-published post higher using the technique introduced in Section 6. So our system will not miss any important information or relationship between stories for a given user. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:15

Through our user study, we determined that people usually have no interest in outdated stories, so missing the temporal relationship with them will not depress the system’s performance. 5.4 Algorithm Summary We summarize the steps of our story clustering algorithm as follows: when a new window of observation starts, our blog analyzer calls the static clustering algorithm on stories posted in the adjacent area first, and clusters them into topic set G. The static clustering algorithm contains two parts: first clustering stories based on link structure, then revising the result by content information. Later, when the crawler detects that a new story vnew is coming during the current window, we run the dynamic clustering algorithm to combine the new story into the existing topics G. After that, G is updated to G , and vnew may join one of the topics already in G, or it exists as a new topic in G . The detection and combining process persists until the time window is long enough—which makes size of |G | large enough—then we start a new window of observation, and repeat all of the steps. 5.5 Main Story Selection and Topic Ranking After finishing clustering stories into topics, we need to choose a main entry for each topic from stories on that topic. The main entry should be the best story to describe the event that the topic talked about. So a user only needs to read this main story most of the time, while using other stories mainly as references. In our system, we use the following formula to select the default main entry: vil ∗ = arg max sim(vil , Ci ), i = 1, 2, . . . , m, l =1,2,... ,ni

(6)

where Ci =

ni 1 vil . |G i | l =1

Ci is the mean vector of topic Gi and vil ∗ is the selected main story of the ith topic. So the main story is the most adjacent one to the topic’s center, and it is thought to be the most representative story to represent the topic. For topic ranking, there are two default options: (1) by time, which helps user find what is going on in a chronological order; (2) by size of cluster, since an important or popular event has a high chance of being posted or commented on by many bloggers. So using our default ranking schemes, the newer or more popular topics will be ranked higher. But our ultimate goal is to present a user with a personalized reading system that finds and ranks all users’ interesting topics higher. We next proceed to describe our personalized ranking scheme. 6. PERSONALIZED RANKING To deal with the personalized ranking problem [Gordon et al. 2006; Fan et al. 2005, 2004], we need to learn a user’s reading habit first. When a user initially ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:16

•

X. Li et al.

registers to use our system, he/she subscribes to some blogs in which he/she is interested. We cluster stories from all the user-subscribed blogs and use the default methods to rank topics and select the main story for each topic. Then we log all actions a user performed while using our system, to refine and obtain a better personalized ranking scheme. The three most interesting things we would like to track are: — the stories a user clicked on, denoted as S1 ; — the stories a user marked as “Read” or “ToRead”, denoted as S2 ; — the stories that are labeled into folder “Archive” or “Favorite”, denoted as S3 . It can be seen that S1 , S2 , and S3 are all story sets. For each user, a personal reading repository containing these three story sets is maintained in our system. To support a large number of users and minimize the storage space for the story repository, we only extract some keywords from the original texts to represent the whole stories. For keyword extraction, we use the method introduced in Chen et al. [2006], which is based on the well-known Latent Semantic Analysis technique [Deerwester et al. 1990]. It is an unsupervised approach for extracting diverse topic phrases from a collection of documents and it avoids the tough problems of previous works, which often ignore the importance of diversity, and thus extract phrases crowded on some hot topics while failing to cover other less obvious but important topics. For topic ranking, we aim to rank a user’s potentially interesting topics higher. So we first extend the keyword extraction from a story to a topic. Besides the keywords we extracted from stories in S1 , S2 , and S3 , we also choose keywords from other stories that share the same topics with stories in S1 , S2 , and S3 . We use T1 , T2 , and T3 to represent the topic sets corresponding with S1 , S2 , and S3 respectively. So Ti = {τi1 , τi2 , . . . , τi|Ti | } (i = 1,2,3), where τij (j = 1, 2, . . . , |Ti |) is the keyword set extracted by LSI from the jth topic in Ti . When new topics (G1 , G2 , . . . , Gm ) arrive, we find each Gi a most matching topic in T1 , T2 , or T3 , and rank them by the match degree. Bigger match degree means a higher possibility of the topic matching a user’s interest. We assign new topics matched with T3 a higher priority score than topics matched with T2 , since for a given user, the actions of “Archive” and “Favorite” carry more weight in representing a user’s interests than that of “Read” and “ToRead.” Similarly, topics matched with T2 have a higher priority than topics matched with T1 . For the topics not matched with any, we rank them by default schemes. The entire personalized topic ranking algorithm is summarized in Table IV. For main story selection, there is no universal criterion, as different people may have different preferences. To capture the most fitting story as the main entry for each topic, we take in account several criteria such as freshness, blog source, and the capability of expressing the total topic. Since given a topic clustered with several stories, some users like to read the latest post; some users like to read the news from authoritative blog sources; some as with our default, recommend one that is most adjacent to the topic’s center. So we use the following formula to choose the main story for each topic Gi , and rank other ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:17

Table IV. Personalized Topic Ranking Algorithm Algorithm Personalized topic ranking Input: topic set {G1 , G2 . . . Gm }, user’s log file Output: well ranked topic set {G 1 , G 2 . . . G m } 1. Extract keywords from stories in a user’s reading log and extend keywords to corresponding topics. Combine them into user’s reading repository to get a expanded T1 , T2 , and T3 . 2. Match new topics G1 , G2 , . . . , Gm with each topic keywords in T1 , T2 , and T3 by Equation (7), where S(Gi , τj ) is calculated by formula (4), and rate them by the biggest match degree. ⎧ ⎪ ⎪ ⎪ ⎨

rG i

Gi , τj if τj ∈ T1 , j = 1, . . . , |T1 |

= max 12 S Gi ,τj if τj ∈ T2 , j = 1, . . . , |T2 | ⎪ ⎪ ⎪ ⎩

if τj ∈ T3 , j = 1, . . . , |T3 | S Gi ,τj 1 3S

(i = 1, 2, . . . , m)

(7)

3. All matched topics are ranked by the rating value. Others by time or the size of topic.

related stories in the same topic:

fs bvil vil ∗ = arg max × f vil + + sim(vil , Ci ) |T| |T| l =1,2,...,ni 1, if vil is f reshest in G ; i = 1, 2, . . . , m. f vil = 0, otherwise

(8)

|T| is the total number of topics in T1 , T2 , and T3 . f s is the number of freshest stories in S1 , S2 , and S3 . A freshest story is the one last posted among all the stories in the same topic. bvil is the number of stories whose blog source is the same as vil in S1 , S2 , and S3 . And the last part is the same as in Formula 6. Over time, the more a user uses our system, the more precise his/her interestingness profile will be. 7. EXPERIMENTS 7.1 Experimental Setup To evaluate our proposed system, we gathered blog posts from more than 3000 blog sources. The dataset consists of about 200,000 pieces of stories collected over a period of two months (from 4/20/06 to 6/20/06). Each story, vi , is uniquely identified by a pair u, b, where u is the URL of the story location, and b is the blog that publishes vi . We rank the collected blogs by the number of stories it generates and list the top ten in Table V. The first column is the RSS feed of each blog, and the second column is the number of stories posted by each blog. Our PBR system clusters all posts globally for all users and returns to each user only the ones he/she subscribed to. Compared with clustering per user, the advantages of performing globally are that, (1) all calculations are be done on a centralized server, so system users have no need to worry about limited network resources and disk storage; (2) clustering globally could improve the relevance and accuracy; (3) it is easy and convenient to implement and supervise a ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:18

•

X. Li et al.

Table V. Top Ten Blogs in Our Collection Ranked by the Number of Posts in the Second Column RSS Feeds http://Tpc.blogrolling.com/rss500.xml http://www.frankston.com/public/rss.xml http://www.nba.com/rss/nba rss.xml http://www.multilingual-search.com/rss2.php http://music.msn.com/services/rss.aspx?chartname=topsongs http://blogs.isixsigma.com/RSS2.asp?action=full http://dealarchitect.tvpepad.com/deal architect/index.rdf http://globalauerrillas.typepad.com/Johnrobb/index.rdf http://feeds.feedburner.com/PodcastNYC-Superfeed http://cunninarealist.bloaspot.com/atoin.xml

Post 500 486 449 434 393 357 313 295 275 273

server-based system. Three groups of experiments are conducted in this section. We first report the performance of clustering results. Then, we show some topic ranking results with different ranking methods. Last we design a questionnaire to ask people some general information about the PBR system. The users we chose to do the study were twenty computer science students who were familiar with blogging and were all regular blog readers. Then we reported our user study after they used PBR for more than four weeks. Regarding the setup of the evaluations, there are two thresholds in our clustering algorithm: λ measures the weight of the hyperlink between two stories and γ measures the merge condition of two topics. By comparing the results with different threshold values, using our collected dataset, we find that if we set the value of λ too small, it will increase the risk of combining unrelated stories. Setting λ too big will increase the risk of failing to combine related stories. The effect of threshold γ works similarly. So in all our experiments, we empirically set them as 0.35 and 0.7 respectively. For the window of observation, we also found that a bigger size of the time window incurs more computation time for clustering the new incoming stories (see Figure 5), and therefore a lower performance clustering result (see Figures 6 and 7), due to the fast dynamic clustering method, which will be discussed later. Thus we restart a new one once a week using the flexible half-bounded window introduced in Section 5, and we arbitrarily set the length of the adjacent area to be three days. 7.2 Clustering Results Our incremental clustering algorithm uses the link information to extract connected subgraphs and content information to measure the degree of similarity. For this reason, we use the clustering algorithm based only on link information, and the clustering algorithm based only on content information, as baselines to compare with our incremental algorithm. The time spent by each algorithm is reported in Figure 5. It shows continuous records for 14 days, during which the system updates once each day. The link-based clustering algorithm is always the fastest one. The others are slower since they both need to check the similarity between topics. At the beginning of each half-bounded time window, the time complexity of the content-only clustering algorithm is high, because the clustering algorithm groups topics on static ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:19

Fig. 5. Runtime during two continuous windows of observation, using link-based, content-based, and our incremental, clustering algorithms.

Fig. 6. Purity during a window of observation, using link-based, content-based, and our integrated incremental clustering algorithms.

stories in the adjacent area. Our incremental clustering algorithm is faster than a content-based clustering algorithm, since it first uses link information to get a rough clustering result, which reduces the amount of topics later revised by the content information. During the incremental period, our incremental algorithm is also faster than a content-based clustering algorithm. Since new stories have a higher chance to talk about topics that already existed, we can quickly combine them into the corresponding topics by any link information they have. But the trend of the time line is linearly increasing with the increase of the time window, because the number of topics is continuously accumulated with time, which increases the amount of comparison time between new stories and those existing topics. Then we use two quality measures, purity [Solomonoff et al. 1998] and Rand Index [Rand 1971] to evaluate the clustering performance based on link-based, content-based, and our final clustering algorithm, separately. Assuming that ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:20

•

X. Li et al.

Fig. 7. Rand Index during a window of observation, using link-based, content-based, and our integrated incremental clustering algorithms.

the clustering algorithm clusters n stories into m cluster, ni is the size of the j ith cluster, n j is the number of stories hand-labeled with the jth class, and ni is the number of stories within the jth class and in the ith cluster at the same time, purity is defined as: purity =

m 1 j max ni . n i=1 j

(9)

The Rand Index, which represents the clustering degree of agreement with hand-labeled classes on average, is defined by: j j m p m p ni n ni + j =1 − 2 i=1 j =1 i=1 2 2 2 j Rand Index = , (10) m p ni n + j =1 i=1 2 2 where p is the number of classes manually labeled. Note that the higher the value of purity, or the lower the value of the Rand Index, the better is the clustering performance. Given a window of a week, we distinguish a subset of twenty topics manually labeled every day, and test the clustering results on such dataset. The process is repeated in four weeks, and the average results are shown in Figures 6 and 7. In Figure 6, link-based clustering is the fastest; however, its performance is the worst, as shown in Figure 7. In Figure 6, our incremental clustering algorithm seems no better than the content-based clustering method. The reason is that during a time window, if a new incoming story has a hyperlink pointing to one of the existing topics, we use a threshold, λ, to measure the combination condition. If there is no such hyperlink, the incremental clustering algorithm uses threshold γ to measure the merge condition; the same as with contentbased clustering. When implementing our PBR system, we set λ ≈ 0.5γ , since stories linked together usually have a good possibility of talking about the same topic as discussed in Section 5.1. The lower combining restriction cannot lead ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:21

Fig. 8. Runtime during two continuous windows of observation, using static clustering algorithm, dynamic clustering algorithm, and our integrated incremental clustering algorithm.

to a better purity result, as seen in Figure 6, since it increases the chance of combining unrelated stories. However, with this lower combining restriction, we can greatly improve the Rand Index, shown in Figure 7, because it reduces the possibility of failing to combine stories that talk about the same event. Taking one with another, our incremental clustering algorithm makes an improved and receptible clustering result as a whole, in addition to the time saved. On the other hand, our incremental clustering algorithm is combined with static clustering and dynamic clustering during a window of observation. So we next compare the performance if we only use a static clustering or a dynamic clustering algorithm to implement the total incremental process. The time required by each algorithm is reported in Figure 8. The runtime is quite different, though they are essentially two separate parts of our final clustering algorithm. If we only use the static clustering algorithm to achieve online clustering, every time new stories come, we would have to rebuild all of the topics to combine them with stories posted before. So the time spent is high. If we use the dynamic clustering algorithm instead, the time requirement appears much lower. However, in practice, if we use the dynamic clustering algorithm alone, the center of some topics may drift far away from the real topic’s meaning. That is another reason why we have to start a new window of observation when using our final clustering algorithm. Our final incremental clustering algorithm is combined with static clustering and dynamic clustering. So at the beginning of each window, our algorithm spends time running the static clustering algorithm. After a while it uses the dynamic clustering algorithm. The problem of topic drift may also exist but is not as bad as only using dynamic clustering. 7.3 Topic Ranking The purpose of our second group of our experiments is to evaluate the topic ranking results with different ranking methods. We report the top ten topics within the same time window of observation (from 6/13/06 to 6/20/06) ranked by the size of the cluster, by time, and by personalized learning results. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:22

•

X. Li et al. Table VI. Top Ten Topics during a Window of Observation (From 6/13/06 to 6/20/06) Ranked by the Size of Cluster Date 6/20 6/19 6/20 6/18 6/20 6/18 6/20 6/19 6/19 6/19

Topic Abstract Japan announces decision to pull out troops from Iraq Verizon Sues Vonage for VoIP Patent Infringement Nokia, Siemens Join in Equipment Venture ‘Cars’ Wins Box Office for 2nd Weekend Sun Microsoft Upgrades Instant Messenger Happy Fathers Day Major Web browsers getting facelifts Wireless Handset Manufacturer Stocks Microsoft releases new Windows Live IM service NASA Statement on Decision to Launch Shuttle Discovery

Size 35 28 26 26 25 22 21 19 19 18

Table VII. Top Ten Topics during a Window of Observation (From 6/13/06 to 6/20/06) Ranked by Time Date 6/20 6/20 6/20 6/20 6/20 6/20 6/20 6/20 6/20 6/20

Topic Abstract What is the expansion of YAHOO? Credit cards sorted for the summer holidays New World Cup worm surfaces U.S. to Buy Anthrax Treatment From HGS Wal-Mart to Add Jobs in Struggling Areas Polycom releases SoundPoint IP 430 VoIP phone and SIP Software Sony appoints Oz distrib chief Major Web browsers getting facelifts Open Source Mac—Free, Open-Source software for OS X U.S. stocks retreat on all indexes

Size 2 1 2 1 1 2 1 21 1 2

Table VI shows the top ten topics ranked by size of what are thought to be the most popular topics during this window of observation. The first column is the publication date of the most recent story on this topic. The last column is the number of stories on that topic. In Table VII, we list the top ten latest stories within the same window by time. There is a small overlapping between the events covered in Tables VI and VII. This is not unexpected, since most of the events listed in both tables are current events. To observe the effect of personalized ranking, we select one particular user’s record, and find that keywords such as “stock,” “market,” and “business,” frequently appear in his log file. His personalized ranking result shown in Table VIII indeed matches his profile. It can also be seen that the personalized result is dramatically different from the default ranking lists. 7.4 User Study To perform an evaluation of our system, a questionnaire was distributed to twenty users who used this system for half a month to get their feedback and subjective evaluation of the system. In the questionnaire, we asked people some general information about the PBR system. Several interesting items are listed in Table IX. These questions were answered on a scale of 5 with 1 = strongly disagree and 5 = strongly agree. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:23

Table VIII. Top Ten Topics during a Window of Observation (From 6/13/06 to 6/20/06) Ranked by Personalized Learning of a User Date 6/19 6/19 6/19 6/18 6/19 6/19 6/20 6/19 6/20 6/19

Topic Abstract Size Market-Focused R&D Eleventes Whirlpool Business Performance To New Level 3 Riding out the stock market storm 4 Japan stocks fall; dollar up against yen 5 4 Ways to Get Started with Business Blogs and RSS 4 Report: NYSE Try to Rival London Stock Exchange 3 UK advertising market hit by lack of support for World Cup 3 U.S. stocks retreat on all indexes 2 India rent-a-wife business grows as men desperately seek spouses 2 More of the Same for Microsoft Stock? 1 Toronto stocks seen rising on enegy M & A activity 1 Table IX. Average Rating of Questionnaire Items by Twenty Users (1 = strongly disagree, 5 = strongly agree) Question Is the integrated blog reading system helpful? Are the design and usability of the system’s user interface acceptable? Does the service save me a lot of time in reading news from blogs? It is easy to understand and use the tags provided by the system? Is the maintaining of favorite stories that the systems provides useful? Do stories within the same topic all talk about the same event? Have stories talking about the same event all been clustered together? Is the preview of each topic useful? Is the default main story of each topic acceptable? Is the personalized ranking helpful?

Rating 4.55 4.02 4.35 3.62 4.34 3.61 3.01 4.26 3.61 4.05

As can be seen from Table IX, users are generally very satisfied with our system although some of the system features remain to be improved. For example, the question “Have stories talking about the same event all been clustered together?” gets an average score of 3.01. This means some stories should be clustered into the same cluster from the point of some users but appear separately in our system. Since there is a trade-off between speed and quality, as discussed in Section 7.2, we have to sacrifice quality more or less to gain high speed for this online system, by using our incremental clustering algorithm. Missing the clustering of some similar stories will not make the users miss any information, but only add a little redundant browsing work. The personalized ranking is found very useful. For personalized ranking, we designed the question “Is the personalized ranking helpful?” in the questionnaire. Since our PBR system provides the default options: ranking by cluster’s size or by time. Users gave the answers to this question based on the comparison with those default options. So the score is a relative value, and average 4.05 means the personalized ranking works well and is useful. We also asked users some short questions about when and how they used our system. The results show that on average, users checked new topics once a day. Almost 80% of users only scanned the top 10∼25 topics once they opened the blog reader. If there are many posts on the same topic, almost half of the users only chose one to read most of the time. Finally, we collected some added comments. The most common positive comments were about the convenience of ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:24

•

X. Li et al.

getting news and information after clustering, and the power of management and maintenance of posts. People also said that the interface should be enriched by some other valuable information such as recommendation of other blogs that they may be interested in, or comments on each topic by other people. 8. CONCLUSIONS AND FUTURE WORK In this article, we designed a blog reading system that automatically clusters stories into topics and provides users with personalized ranking lists of topics. We proposed an incremental clustering algorithm that not only captures temporal relationships between stories covered in the current window of observation and stories in adjacent area, but also satisfies the requirements of online processing. We also proposed a personalized learning infrastructure based on a user’s log file. By using the clustering and the personalized learning results, we can recommend an optimal entry for a user, of each topic, to avoid user’s repeated reading of the same work and thereby ranking all of his/her potentially interesting topics higher. We are continuing to refine the system in several directions. One area for improvement is the performance of clustering and ranking results. We will collect new features and continue to tune parameters to improve the performance. At the same time, we will try to classify the topics into a high level and extract a good summary for each topic. The other area for improvement is the user interface. The presentation of current cluster results is a fairly standard list view. We are exploring more functions for better interaction with users, such as adding a feedback table to collect comments, and recommending other RSS feeds that may interest the user. REFERENCES ADAR, E., ZHANG, L., ADAMIC, L. A., AND LUKOSE, R. M. 2004. Implicit structure and the dynamics of blogspace. In Proceedings of the 13th International World Wide Web Conference Workshop on the Weblogging Ecosystem. 35–39. ALLAN, J., PAPKA, R., AND LAVRENKO, V. 1998. Online new event detection and tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 37–45. AVESANI, P., COVA, M., HAYES, C., AND MASSA, P. 2005. Learning contextualised Weblog topics. In Proceedings of the 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, E. Adar, N. Glance, and M. Hurst, Eds. BAKER, L. D. AND MCCALLUM, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 96–103. BALABANOVICH, M. AND SHOHAM, Y. 1997. Fab: Content-based, collaborative recommendation. Comm. ACM, 40, 3, 66–72. BANSAL, N., BLUM, A., AND CHAWLA, S. 2004. Correlation clustering. Machine Learn. 56, 3, 89–113. BANSAL, N., CHIANG, F., KOUDAS, N., AND TOMPA, W. F. 2007. Seeking stable clusters in the blogosphere. In Proceedings of the 33rd International Conference on Very Large Databases. 806–817. BANSAL, N. AND KOUDAS, N. 2007. BLOGSCOPE: A system for online analysis of high volume text streams. In Proceedings of the 33rd International Conference on Very Large Databases. 1410– 1413. BERN, M. AND EPPSTEIN, D. 1996. Approximation algorithms for geometric problems. In Approximation Algorithms for NP-Hard Problems, D. S. Hochbaum, Ed. PWS Publishing Company, Boston, 296–345. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.


•

9:25

BONETT, M. 2001. Personalization of Web services: Opportunities and challenges. Ariadne 28. BRANTS, T. AND CHEN, F. R. 2003. A system for new event detection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information. 330–337. BROOKS, C. H. AND ANDMONTANEZ, N. 2005. An analysis of the effectiveness of tagging in blogs. In AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, vol. 4737, 1–20. CAYZER, S. 2004. Semantic blogging and decentralized knowledge management. Comm. ACM 47, 12, 47–52. CHEN, J. L., YAN, J., ZHANG B. Y., YANG, Q., AND CHEN, Z. 2006. Diverse topic phrase extraction through latent semantic analysis. In Proceedings of the 6th International Conference on Data Mining. 834–838. DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W., LANDAUER, T. K., AND HARSHMAN, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391–407. DELWICHE, A. 2005. Agenda-setting, opinion leadership, and the world of Weblogs. First Monday, 10, 12. FAN, W. G., GORDON, M. D., AND PATHAK, P. 2004. Discovery of context-specific ranking functions for effective information retrieval by Genetic Programming. IEEE Trans. Knowl. Data Eng. 16, 4, 523–527. FAN, W. G., GORDON, M. D., AND PATHAK, P. 2005. Genetic programming-based discovery of ranking functions for effective Web search. J. Manag. Inform. Syst. 21, 4, 37–56. FERRAGINA, P. AND GULLI, A. 2005. A personalized search engine based on Web-snippet hierarchical clustering. In Proceedings of the 14th International Conference on the World Wide Web Special Interest Tracks and Posters. 801–810. GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th Conference on Hypertext and Hypermedia. 225–234. GIOTIS, I. AND GURUSWAMI, V. 2006. Correlation clustering with a fixed number of clusters. In Proceedings of the ACM Symposium on Discrete Algorithms, 1167–1176. GORDON, M. D., FAN, W. G., AND PATHAK, P. 2006. Adaptive Web search: Evolving a program that finds information. IEEE Intell. Syst. 21, 5, 72–77. HAYES, C., AVESANI, P., AND VEERAMACHANENI, S. 2006a. An analysis of the use of tags in a blog recommender system. ITC-IRST Tech. rep., IJCAI: 2772–2777. HAYES, C., AVESANI, P., AND VEERAMACHANENI, S. 2006b. An analysis of bloggers and topics for a blog recommender system. In Proceedings of the 7th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) Workshop on Web Mining. HERRING, S. C., KOUPER, I., PAOLILLO, J. C., SCHEIDT, L. A., TYWORTH, M., WELSCH, P., WRIGHT, E., AND YU, N. 2005. Conversations in the blogosphere: An analysis “from the bottom up.” In Proceedings of the 38th Hawaii International Conference on System Sciences. 1530–1605. JEH, G. AND WIDOM, J. 2003. Scaling personalized Web search. In Proceedings of the 12th International Conference on World Wide Web. 271–279. KANTROWITZ, M., BEHRANG, M., AND MITTAL, V. 2000. Stemming and its effects on TFIDF Ranking. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 357–359. KARGER, D. R. AND QUAN, D. 2005. What would it mean to blog on the semantic Web. Web In Semantics: Science, Services and Agents on the World Wide Web, vol. 3, 147–157. KARYPIS, G. AND KUMAR, V. 1998. Multi-level k-way partitioning scheme for irregular graphs. J. Parall. Distrib. Comput., vol. 48, 96–129. KELLEHER, J. AND BRIDGE, D. 2004. An accurate and scalable collaborative recommender. Artif. Intell. Rev. 21, 3–4, 193–213. KLEINBERG, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604–632. KOLLER, D. AND SAHAMI, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of the 14th International Conference on Machine Learning. 170–178. KUMAR, R., NOVAK, P., RAGHAVAN, S., AND TOMKINS, A. 2003. On the bursty evolution of blogspace. In Proceedings of the 12th International Conference on World Wide Web. 159–178. LIU, F., YU, C., AND MENG, W. Y. 2004. Personalized Web search for improving retrieval effectiveness. IEEE Trans. Knowl. Data Eng. 16, 1, 28–40. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 9, Publication date: July 2009.

9:26

•

X. Li et al.

MANCORIDIS, S., MITCHELL, B., RORRES, C., CHEN, Y., AND GANSNER, E. 1998. Using automatic clustering to produce high-level system organizations of source code. In Proceedings of the 6th International Workshop on Program Comprehension. 45–53. MARLOW, C. 2004. Audience, structure and authority in the Weblog community. In Proceedings of the International Communication Association Conference. PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The PageRank citation ranking: Bringing order to the Web. Tech. rep. Stanford University. QIU, F. AND CHO, J. 2006. Automatic identification of user interest for personalized search. In Proceedings of the 15th International Conference on World Wide Web. 727–736. QUINTARELLI, E. 2005. Folksonomies: Power to the people. ISKO Italy-UniMIB Meeting. RAND, W. M. 1971. Objective criteria for the evaluation of clustering methods. J. Amer. Statis. Assoc. 66, 336, 846–850. SALTON, G., AND MCGILL, M. J. 1983. An Introduction to Modern Information Retrieval. McGrawHill, Inc., New York. SARWAR, B. M., KARYPIS, G., KONSTAN, J., AND RIEDL, J. 2002. Recommender systems for largescale e-commerce: Scalable neighborhood formation using clustering. In Proceedings of the 5th International Conference on Computer and Information Technology. SINGHAL, A. AND SALTON, G. 1995. Automatic text browsing using vector space model. In Proceedings of the 5th Dual-Use Technologies and Applications Conference. 318–324. SOLOMONOFF, A., MIELKE, A., SCHMIDT, M., AND GISH, H. 1998. Clustering speakers by their voices. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 757–760. WU, Z. AND LEAHY, R. 1993. An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 15, 11, 1101–1113. TSAI, T.-M., SHIH, C.-C., AND CHOU, S.-C. T. 2006. Personalized blog recommendation using the value, semantic, and social model. In Innovations in Information Technology. 1–5. YANG, Y., PIERCE, T., AND CARBONELL. J. G. 1998. A study on retrospective and online event detection. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 28–36. Received October 2006; accepted October 2008


An online blog reading system by topic clustering and ... - CiteSeerX

An online blog reading system by topic clustering and ... - CiteSeerX

Suggest Documents

An online blog reading system by topic clustering and ... - CiteSeerX

Analyzing Reading Behavior by Blog Mining - CiteSeerX

An online sales system - Martin Toft's blog

Dental Blog Topic Ideas

A Topic Suggestion System for Blog Writers and ... - Semantic Scholar

Tomographic Clustering To Visualize Blog Communities ... - CiteSeerX

Tomographic Clustering To Visualize Blog Communities ... - CiteSeerX

Blog Clustering and Analysing System (BCAS) Search ... - Google Sites

An Online Recommender System - CiteSeerX

CLOTU: An online pipeline for processing and clustering ... - CiteSeerX

Travel Blog Assistant System (TBAS) - An Example ... - CiteSeerX

An stable online clustering fuzzy neural network for nonlinear system ...

Online data clustering algorithms in an RTLS system - Semantic Scholar

User Clustering in Online Advertising via Topic Models

Reading Strategy: Tackling Reading through Topic and Main Ideas - Eric

Clustering of Tethered Satellite System Simulation Data by an ...

Online Clustering of Processes - CiteSeerX

Bayesian Clustering by Dynamics - CiteSeerX

FRQ'S FOR SUMMER READING - Blog

Preprocessing of Slovak Blog Articles for Clustering - CiteSeerX

Choosing Reading Passages for Vocabulary Learning by Topic to ...

Historicizing Topic Models. A distant reading of topic ...

Text Clustering for Topic Detection - Semantic Scholar

TOPIC MODELING: CLUSTERING OF DEEP WEBPAGES