Adding Personality to Information Clustering - Semantic Scholar

1 downloads 0 Views 206KB Size Report
We present algorithms for performing the various cluster personalization ... tering can be used in a personalized content management system for web-based ..... 1 For convenience, we refer to a cluster by the first keyword in its keyword list.
Adding Personality to Information Clustering Ah-Hwee Tan and Hong Pan Kent Ridge Digital Labs, 21 Heng Mui Keng Terrace, Singapore 119613 ahhwee,[email protected]

Abstract. This article presents a new information management method

called user-con gurable clustering that integrates the exibility of clustering systems in handling novel data and the ease of use of categorization systems in providing structure. Based on a predictive self-organizing network that performs synchronized clustering of information and preference vectors, we illustrate how a user can in uence the clustering of information vectors by encoding his/her preferences as preference vectors. We present algorithms for performing the various cluster personalization functions including labeling, adding, deleting, merging, and splitting of information clusters. User-con gurable clustering has been incorporated into a web-based competitive intelligence system known as Flexible Organizer for Competitive Intelligence (FOCI). We illustrate a sample session of FOCI which shows how a user may create and personalize an information portfolio according to his/her preferences and how the system discovers novel information groupings while organizing familiar information according to user-de ned themes.

1 Introduction Categorization and clustering have been two fundamentally distinct approaches to information organization and content management. Categorization, also known as classi cation, refers to the task of assigning patterns or objects to one or more prede ned classes. Categorization is supervised in nature. It provides good control in the sense that information is organized according to the structure de ned by a user. However, due to the prede ned structure, categorization is not well suited to handle novel data. In addition, much e ort is needed to build a categorization system beforehand. It is necessary to specify classi cation knowledge in terms of some classi cation rules/keywords [6] or to construct a categorization system through some supervised learning algorithms [8]. The former method requires knowledge speci cation and the latter assumes the availability of a usually large set of annotated or labeled examples. Clustering, on the other hand, is unsupervised in nature. For unsupervised systems such as k-means [5], Scatter/Gather [3, 4], and Self-Organizing Map (SOM) [7], there is no need to train or construct a classi er as information is organized automatically into groups based on their similarities. However, a user has very little control on how the information is grouped together. Although it is possible to ne tune the parameter values of similarity measures to control the degree of coarseness, the clustering structure is a ected globally. In addition,

the structure uncovered through the clustering process can be unpredictable. Whereas it is acceptable for a pool of relatively static information, in situations where we receive new information everyday, information may be grouped (based on di erent themes) into di erent clusters between days. The ever-changing cluster structure can be rather undesirable. In this paper, we present a novel information management method known as user-con gurable clustering that integrates the complementary strengths of clustering and categorization. Using user-con gurable clustering, information is rst organized through automatic clustering to derive the natural information groupings. A user, upon inspecting the information groupings, can then modify the structure according to his/her requirement and preferences through a suite of cluster manipulation functions. It is an interactive process of clustering, personalization, and discovery through which a user turns an automatically generated cluster structure into his/her preferred organization. The information clustering engine presented in this paper belongs to a class of predictive self-organizing networks known as Adaptive Resonance Associative Map (ARAM) [9] that learns information groupings or clusters dynamically onthe- y to encode pairs of information and preference vectors. ARAM is a natural extension of a family of unsupervised learning systems, namely Adaptive Resonance Theory (ART) networks, to incorporate supervisory preference signals. It is chosen over traditional clustering engines, such as k-means and SOM, as online incremental clustering is a critical requirement for supporting the interactive personalization process. User-con gurable clustering has been incorporated into a web-based competitive intelligence system known as Flexible Organizer for Competitive Intelligence (FOCI). FOCI bridges the gap between raw search results and organized competitive information by providing an integrated platform that supports the key activities in a competitive intelligence cycle. We illustrate a sample session with FOCI on how a user may create and personalize an information portfolio and subsequently use it for tracking new information. The rest of this article is organized as follows. Section 2 provides a summary of ARAM network, the underlying information clustering engine. Section 3 presents the user-con gurable clustering system architecture and the clustering/personalization algorithms. Section 4 illustrates how user-con gurable clustering can be used in a personalized content management system for web-based competitive intelligence. The nal section concludes and discusses future work.

2 Adaptive Resonance Associative Map ARAM belongs to a family of predictive self-organizing neural networks known as predictive Adaptive Resonance Theory (ART) [1] that performs incremental supervised learning of recognition categories (pattern classes) and multidimensional maps of patterns. An ARAM system (Figure 1) can be visualized as two overlapping Adaptive Resonance Theory (ART) modules consisting of two input elds F1a and F1b with an F2 category eld.

category field F2

y

a

wb

w

-

xa

b

F1

xb

feature field

ARTa

+ A

+

feature field +

b

-

B

+

a

F1

a

ARTb

Fig. 1. The Adaptive Resonance Associative Map architecture. For personalized clustering, the F1a eld serves as the input eld for the information vector A and the F1b eld serves as the input eld for the preference vector B. The F2 eld contains a number of category nodes, each encoding a template information vector and a template preference vector. Given an information vector A with an associated preference vector B, the system rst searches for an F2 cluster J encoding a template information vector wJa that is closest to the information vector A according to a similarity function. A template matching process then checks if the template information vector wJa and the template preference vector wJb of the selected category have a good match with the input information vector A and the input preference vector B respectively. If so, ARAM enters an resonance state in which the template vectors of the F2 cluster J are modi ed to encode the input information and preference vectors. Otherwise, the cluster is reset and the system repeats the process until a match is found. The ART modules used in ARAM can be ART 1 [1], that categorizes binary patterns, or analog ART modules such as ART 2, ART 2-A, and fuzzy ART [2] that categorize both binary and analog patterns. Fuzzy ARAM [9] that is based on fuzzy ART is used in our experiments.

3 User-con gurable Clustering A user-con gurable information clustering system (Figure 2) comprises an information clustering engine for clustering of information based on similarities, a user interface module for displaying the information groupings and obtaining user preferences, a personalization module for de ning, labeling, and modifying

cluster structure, and a knowledge base for storing user-de ned cluster structures.

Fig. 2. The personalized information clustering system architecture. We are concerned with the organization of information in general, which can be any object, such as documents, persons, companies, and countries etc. Information can be encoded as information vectors. Each unit of information I can be represented by an information vector A of attributes or features,

A = (a1 ; a2; : : : ; aM )

(1)

where ai is a real-valued number between zero and one, indicating the degree of presence of attribute i. In the case of documents, for example, the features in the representation vectors could be word tokens commonly known as keywords. The feature sets can be prede ned manually or generated automatically from the information set. For information management, user preferences are represented by preference vectors that indicate the preferred groupings of the information. A preference vector B is de ned by B = (b1; b2; : : : ; bN ) (2) where bi is either zero or one, indicating the presence or absence of the userde ned label Li . The personalization module works in conjunction with the information clustering engine to incorporate user preferences to modify the automatically generated cluster structure. Through the user interface module and the personalization module, a user is able to perform a wide range of cluster manipulation functions including labeling, adding, deleting, merging, and splitting of clusters through the use of labels or themes. The customized cluster structure, in the form of ARAM network, can be stored in the cluster structure knowledge base

and retrieved at a later stage for processing new information. Based on the personalized cluster structure, new information can be organized according to the user's preferences captured over the previous sessions.

3.1 Clustering The algorithm for clustering using ARAM is summarized in Table 1. If a prede ned cluster structure exists, the system loads the ARAM network before clustering. Otherwise, a new network is created which contains zero cluster. During clustering, for each unit of information (I ), a pair of vectors (A; E) is presented to the system, where A is the information vector of I and E is a null vector such that Ei = 0 for i = 1; : : : ; N . The process of encoding pattern pairs is as described in section 2. Clustering is completed when ARAM is stable, in the sense that for a given set of information, no new cluster is created and changes in template vectors are below a speci c threshold. With a prede ned cluster structure, fuzzy ARAM organizes the information according to the cluster structure. Without prede ned network structure, ARAM reduces to a pure clustering system that self-organizes the information based on the similarities among the information vectors only. The coarseness of the information groupings is controlled by the ARTa vigilance parameter (a ).

Table 1. Algorithm for clustering information. If a prede ned cluster structure exists, load ARAM network N ; else initialize ARAM network N . Loop For each information item I , 1. Derive an information vector A based on I . 2. Derive a null preference vector E. 3. Present (A E) to N for learning. 4. Record index of cluster encoding I . until N is stable. ;

J

3.2 Personalization ARAM can also operate in an insertion mode whereby a pair of information and preference vectors can be inserted directly into an ARAM network. Whereas learning mode is used for clustering and obtaining the cluster assignments of information vectors, insertion mode enables a computer user to in uence the clusters created by ARAM through indicating his/her own preferences in the forms of preference vectors. During insertion, the vigilance parameters a and

are each set to 1 to ensure that user preferences are explicitly encoded in the cluster structure. Under most cases, ARAM will create a new F2 cluster node to encode the input information and preference vectors. In the event when the input information vector is identical to the template information vector of an existing cluster and there is a mismatch between the template preference vector and the input preference vector, we force the template preference vector to equal the input preference vector. This is appropriate for the purpose of personalization to favour preferences given directly by the user. We present the algorithms for performing the various cluster personalization functions below. b

Labeling Information Clusters Associating clusters with labels or themes

allows a user to "mark" speci c information groupings that are of interest to the user so that the information can be found readily in the future and new information can be organized according to such information groupings. To associate a cluster J with a label L, we simply insert (wJa , B) into ARAM, where B is a preference vector representing L. Labels re ect the user's interpretation of the groupings. They are useful landmarks to the user in navigating and locating old as well as new information.

Adding Information Clusters A user can de ne and insert his/her own clus-

ters into an ARAM network so that the information can be organized according to such information groupings. The inserted clusters re ect the user's preferred way of grouping information and are used as the default slots of organizing information. To insert a new cluster, a pair of information and preference vectors (A; B) are rst derived based on the key attributes of the information in the new cluster and the cluster label. After insertion, ARAM re-generates the cluster structures by clustering all the information vectors again. With the addition of user-de ned clusters, new clusters may be generated during the reclustering process.

Deleting Information Clusters A user can delete a cluster by associating

it with a Deleted label. Information in a deleted cluster can then be handled separately and hidden from the user. In addition, a deleted cluster serves as a lter for removing unwanted information that is similar in content in the future. To delete a cluster J , a pair of template information and preference vectors (wJa ; D) is inserted into the cluster structure, where wJa is the template information vector of the cluster and D is derived based on the Deleted label.

Merging Information Clusters. Merging of clusters allows a user to combine two or more information groupings generated by clustering into a common theme. To merge clusters J1 ; : : : ; Jn , the algorithm rst derives a preference vector B encoding the user-speci ed label L. The vector pairs (wJa1 ; B); : : : ; (wJan ; B) are then inserted into ARAM one at a time so that the template preference vectors of the clusters are modi ed to encode the common theme.

Splitting Information Clusters Splitting of clusters allows a user to reorganize an information group, that he/she deems containing diverse content, into smaller clusters of speci c themes. To split a cluster J , a user needs to select a number of information items I1 ; : : : ; In from the cluster as the pivots. The algorithm then derives information and preference vector pairs, namely (A1 ; B1 ); : : : ; (An ; Bn ), where A1 ; : : : ; An are the information vectors of I1 ; : : : ; In respectively and B1 ; : : : ; Bn are preference vectors derived from the cluster labels L1 ; : : : ; Ln respectively. After inserting these vector pairs into the ARAM network, each individual information vector, originally in the cluster J , will be re-organized into one of the smaller clusters depending on its similarities to A1 ; : : : ; An .

4 Personalized Information Management We incorporate user-con gurable clustering into a web-based competitive intelligence system known as Flexible Organizer for Competitive Intelligence (FOCI) [10]. FOCI bridges the gap between raw search results and organized competitive information by providing an integrated platform that supports the key activities in a competitive intelligence cycle. FOCI constructs information portfolios by gathering and organizing on-line information into automatically generated folders. A user can then annotate and personalize the portfolios in terms of the content and how the content is organized (i.e. the information structure) according to his/her needs and preferences. The personalized portfolios can be constantly updated by tracking and organizing new information automatically. The portfolios thus function as "living reports" that can be published and shared by other users. In all, the system provides an environment for gathering, organizing, tracking, and publishing of competitive information on the web. In FOCI, the objective of personalizing an information portfolio is twofold. First, organizing the information according to a user preferred structure facilitates browsing and reporting. In addition, a personalized portfolio can serve as a template for tracking and organizing new information as well as highlighting novel information.

4.1 Encoding Information Vectors We have adopted a bag-of-word approach for representing text-based documents. To perform real-time content aggregation and clustering, we estimate the content of the pages based on the information provided on the search results pages by the search engines (instead of loading the original documents). In addition to the keywords contained in the titles and descriptions of links, we also make use of the URL addresses which provide meta information of the web pages. For a document d, we derive its information vector A = (a1 ; a2 ; : : : ; aM ) such that ai = tf (wi )  r(wi )(1 ; r(wi )) (3)

where the term frequency tf (wi ) is the number of times the keyword wi appears in document d and the document ratio r(wi ) is computed by df (wi ) (4) r (w ) = i

N

where the document frequency df (wi ) denotes the number of documents that wi appears in and N is the number of the documents in the collection. The above term weighting scheme gives advantage to keywords appearing in approximately half of the documents in the collection so as to encourage more compressed cluster structures. The information vector is furthered normalized such that the feature values are between zero and one.

4.2 Encoding Preference Vectors User preferences, in this context, are in the form of themes or labels assigned by a user to individual documents or clusters. All labels speci ed by the users are stored in a Label Table. For a label l, we encode a preference vector B = (b1 ; b2 ; : : : ; bN ) such that  1 if wi = l bi = (5) 0 otherwise. where wi is the ith entry in the label table and N is the number of labels in the table. If a user-speci ed label cannot be found in the table, the label is added to the table and the dimension of the preference vectors (N ) is updated accordingly.

4.3 Clustering An information portfolio on \text mining" was created by integrating search results of four internet search engines. For illustration purpose, we constrainted the size of the portfolio by selecting only top 25 hits from each search engine. After removing duplicated links, there were a total of 71 hits. Figure 3 depicts the clustering results based on a combination of URL and content-based keyword features. There are 17 clusters, each characterized by one to three keywords listed in decreasing order of importance. Three clusters, namely fortune, data, and information 1 are the most prominent ones with 20, 17, and 7 documents respectively.

4.4 Personalization Based on the raw cluster structure generated, this section illustrates how a user may use the various cluster manipulation functions, namely labeling, inserting, merging, and splitting, to personalize his/her portfolios. Figure 4 shows a partially personalized portfolio. The fortune cluster containing news articles from the Fortune news site has been labeled under the 1

For convenience, we refer to a cluster by the rst keyword in its keyword list.

Fig. 3. Clusters created based on the 71 documents collected through four internet search engines.

theme of Fortune News. In addition, a number of user-de ned cluster have been created under the theme of Technology. The documents in these user-de ned clusters are mainly from the original data, information, and knowledge clusters in gure 3. In addition, a user-de ned cluster with a keyword IBM under Company/Product manages to pull out a link to a IBM Business Intelligence/Text Mining page which was buried somewhere previously. With these user-de ned clusters, new clusters have emerged. The most interesting cluster discovered is the websom cluster containing three links related to WEBSOM, a popular text mining technology. Figure 5 shows an exemplary fully personalized portfolio. A number of split and merge clusters operation have been performed to organize the clusters into ve themes. This portfolio has used a combination of organizing schemes. While much of the information is grouped according to their sources and the nature of the content (such as News, Company/Products, Research, and Events), there is a horizontal grouping on Technology that organizes information according to the various sub elds and related topics in text mining, such as knowledge management, data mining, and information retrieval.

4.5 Tracking by Incremental Clustering

In this section, we show the bene ts of tracking and clustering new information using the personalized portfolio. A new set of 42 documents was collected through three additional search engines. Without prior structure, the documents would be organized into the clusters as shown in gure 6. In contrast, gure 7 shows the clustering result when the new documents are organized based on

Fig. 4. A partially personalized portfolio on text mining.

Fig. 5. A personalized portfolio on text mining. All information have been organized into one of the ve themes.

the personalized cluster structure. There are 113 documents in the combined portfolio. A signi cant portion of the new information, especially those in the search, software, information, and knowledge clusters ( gure 6), have been organized under the themes of technology and Company/Product. Some of the other clusters remain highlighting information that do not t into the personalized portfolio. The most prominent group is the businesswire cluster that contains news articles from the BusinessWire news site. This indicates that the system discovers novel information groupings while organizing familiar information into the user's personalized structure.

Fig. 6. Clusters created based on the 42 new documents without personalization.

Fig. 7. The personalized portfolio after integrating the new documents.

5 Conclusions This paper has presented a new information management method that integrates the complementary strengths of clustering and categorization. The method is more exible than a pure categorization system in which information has to be assigned to one or more pre-de ned categories or groups. On the other hand, it is more manageable than a pure clustering or self-organizing system in which users have very little control over how the information is organized. We have also described a direct application of the proposed method to creating and managing personal information portfolios on the web. Another aspect that we have only brie y explored is the discovery of new information or knowledge. As a user de nes his or her know-how and interpretation of the environment in terms of how he/she wants the information to be organized, any information that falls outside of the de ned cluster structure is thus new and potentially interesting. This helps a user to identify information that is novel with respect to his/her experience.

References 1. G. A. Carpenter and S. Grossberg. A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37:54{115, 1987. 2. G. A. Carpenter, S. Grossberg, and D. B. Rosen. Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4:759{771, 1991. 3. D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings, 15th ACM SIGIR, 1992. 4. D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Constant interactive-time scatter/gather browsing of very large document collections. In Proceedings, 16th ACM SIGIR, 1993. 5. V. Faber. Clustering and the Continuous k-Means Algorithm. Los Alamos Science, 1994. 6. P. J. Hayes, P. M Andersen, I. B. Nirenburg, and L. M. Schmandt. Tcs: A shell for content-based text categorization. In Proceedings, Sixth IEEE Conference on Arti cial Intelligence Applications, pages 320{326, 1990. 7. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. Creating an order in digital libraries with self-organizing maps. In Proceedings, WCNN'96, San Diego, 1996. 8. D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text classi ers. In Proceedings, 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), pages 298{306, 1996. 9. A.-H. Tan. Adaptive Resonance Associative Map. Neural Networks, 8(3):437{446, 1995. 10. A-H. Tan, H-L. Ong, H. Pan, J. Ng, and Q-X. Li. FOCI: A personalized web intelligence system. In Proceedings, IJCAI workshop on Intelligent Techniques for Web Personalisation, Seattle, pages 14{19, 2001. This article was processed using the LATEX macro package with LLNCS style