Boosting Bookmark Category Web Page Classification Accuracy using Multiple Clustering Approaches Chris Staff Department of Artificial Intelligence University of Malta Email:
[email protected] Abstract Web browser bookmark files are used to store links to web sites that the user would like to revisit. However, bookmark files tend to be under-utilised, as time and effort is needed to keep them organised. We use four methods to index and automatically classify documents referred to in 80 bookmark files, based on document title-only and full-text indexing and two clustering approaches. We evaluate the approaches by selecting a bookmark entry to classify from a bookmark file and re-creating a snapshot of the bookmark file to contain only entries created before the selected bookmark entry. Individually, the algorithms have an accuracy at rank 1 similar to the title-only baseline approach, but the different approaches tend to make different recommendations. We improve accuracy by combining the recommendations at rank 1 in each algorithm. The baseline algorithm is 39% accurate at rank 1 when the target category contains 7 entries. By merging the recommendations of the 4 approaches, we reach 78.7% accuracy on average, recommending a maximum of 3 categories. 30.6% of the time we need make only one recommendation which is correct 81.4% of the time.
1. Motivation Web browsing software such at Safari, Internet Explorer, and Mozilla Firefox, include a ‘bookmark’ or ‘favorites’ facility, so that a user can keep an electronic record of web sites or web pages that they have visited and are likely to want to revisit. It is usually possible to manually organise these bookmark entries into related collections called ‘folders’ or categories. If bookmark files are kept organised and up-to-date, they could be a good indication of a user’s long-term and short-term interests which could be used to automatically identify and retrieve related information. However, bookmark
files require user effort to keep organised, so a collection of bookmarks tends to become disorganised over time [1], [2]. We describe HyperBK2 which can assist users to keep a collection of bookmarks organised by recommending the category in which to store the entry for a web page in the process of being bookmarked. We examine some different approaches to indexing web pages, deriving category representations, and classifying web pages into categories. Ideally, a user would be recommended a single category which would always be the correct one (i.e., the user would never opt to save the entry to some other category). The approaches that we have compared do not meet this ideal, but we can offer the user a selection of categories that may include the correct one. Of course, we can do this trivially by offering the user all the categories that exist, (just as in information retrieval we can guarantee a recall of 100% by retrieving all documents in the collection), so we want to show the user as small a selection of recommendations as possible, while maximising the chances that the small selection contains the correct, target, category. We experimented with taking the top-5 recommendations from two different approaches and fusing them, which resulted in an average of 7 recommendations in a results set and an accuracy of 80%, and with fusing the results of the four different approaches at rank 1, which gives comparable accuracy, but we need offer the user a maximum of only 3 recommendations. In section 2 we discuss similar systems. HyperBK2’s indexing and classification approach is discussed in section 3, and the evaluation approach in section 4, and the results are presented and discussed in section 5. Section 6 outlines our future work and conclusions.
2. Background and Similar Systems Web pages are frequently classified to automatically create web directories [3], [4], [5], [6] with predefined categories [7] or dynamic or ad hoc categories [6], [3], or to assist users in bookmarking favourite web pages [8]. Bookmarking is a popular way to store and organise information for later use [9]. However, drawbacks exist, especially when the number of bookmarks increases over time [10]. Bookmark managers support users in creating and maintaining reusable bookmarks’ lists. These may store information locally, such as HyperBK [11], Conceptual Navigator1 and Check&Get2 or centrally, such as Caribo [8], Delicious3 and BookmarkTracker4 . Web pages may be classified using contextual information, such as neighbourhood information and link graphs [4], [5], using supervised [12], or partially supervised learning techniques [6]. [7] summarize a web page before classifying it, to eliminate noise in the form of navigational links and adverts. Delicious, which is an online service, allows users to share bookmarks. Categorisation is aided by the use of tags, which users associate with their bookmarks. However there are no explicit category recommendations when a new bookmark is being stored. InLinx [7] provides for both recommendations and classification of bookmarks into “globally predefined categories”. Classification is based on the user’s profile and the web-page content. CariBo [8] classifies a bookmark by first establishing a similarity in the interests of two users and then finding a mapping between the folder location of a bookmark in the collaborators’ bookmark files and that of the target user’s bookmark hierarchy. In previous work [11], we built a bookmark management system, HyperBK, that can recommend a destination bookmark category (folder). However, only a small number of bookmark files had been used in the evaluation. In HyperBK2, we have modified our approach to indexing and classification, and we have evaluated the new approach using 80 bookmark files.
3. HyperBK2’s Indexing and Classification Approach The literature suggests that approaches to web page classification are frequently performed using a global classification taxonomy [7] or make use of 1. 2. 3. 4.
http://www.abscindere.com/ http://activeurls.com http://del.icio.us http://www.bookmarktracker.com
a web page’s neighbourhood information [4], [5]. We want to take a partially supervised approach to clustering [6]: the only sources of information are the web page to be bookmarked and web page entries in the user’s existing bookmark categories (positive examples). We avoid using a global classification taxonomy, instead using the categories that a user has created in his or her own bookmark file. This allows our recommendations to be personalised, and bookmark entries will be grouped according to an individual user’s needs and preferences. We examine four approaches to indexing bookmark files and classifying web pages into an existing bookmark category. We use one, called TITLE-ONLY, as a baseline. As the name suggests, the document title only is indexed using TFIDF [13]. The indexed titles are combined to build a centroid representation of a category. An incoming document is classified according to the similarity of its title to each of the category centroids. The other three approaches, FULL-TEXT, CLUSTER, and SINGLETON all build their indices, and classify web pages based on a document’s full-text. They vary according to what bookmarked web pages (bookmark entries) in a category are used to derive centroids for categories to compare to the incoming web page. TITLE-ONLY and FULL-TEXT build one centroid per category, using all the entries in the category. SINGLETON treats each entry as a cluster centroid, so a category containing n entries will have n centroids. CLUSTER clusters the n entries in a category using a thresholded similarity measure, deriving m centroids (where 1