13
Journal of Intelligent & Fuzzy Systems 14 (2003) 13–24 IOS Press
Knowledge discovery in virtual community texts: Clustering virtual communities A.M. Oudshoffa, I.E. Bosloperb , T.B. Klosc and L. Spaanenburg d,∗ a
KPN Mobile, P.O. Box 30139, 2500 GC The Hague, The Netherlands ECCOO, Broerstraat 4, 9712 CP Groningen, The Netherlands c CWI, P.O.Box 94079, 1090 GB Amsterdam, The Netherlands d Lund University, Department of Information Technology, P.O. Box 118, 22100 Lund, Sweden b
Abstract. Automatic knowledge discovery from texts (KDT) is proving to be a promising method for businesses today to deal with the overload of textual information. In this paper, we first explore the possibilities for KDT to enhance communication in virtual communities, and then we present a practical case study with real-life Internet data. The problem in the case study is to manage the very successful virtual communities known as ‘clubs’ of the largest Dutch Internet Service Provider. It is possible for anyone to start a club about any subject, resulting in over 10,000 active clubs today. At the beginning, the founder assigns the club to a predefined category. This often results in illogical or inconsistent placements, which means that interesting clubs may be hard to locate for potential new members. The ISP therefore is looking for an automated way to categorize clubs in a logical and consistent manner. The method used is the so-called bag-of-words approach, previously applied mostly to scientific texts and structured documents. Each club is described by a vector of word occurrences of all communications within that club. Latent Semantic Indexing (LSI) is applied to reduce the dimensionality problem prior to clustering. Clustering is done by the Within Groups Clustering method using a cosine distance measure appropriate for texts. The results show that KDT and the LSI method can successfully be applied for clustering the very volatile and unstructured textual communication on the Internet. Keywords: Knowledge discovery in text, clustering, virtual community, latent semantic indexing, web portal
1. Introduction The Internet has grown by large factors in size and popularity over just a few years. Such a steep outgrowth creates scaling problems of a size unheard of in a digital world. Early 2000, the World-Wide Web contains already 19 terabytes of information in 1 billion documents, and daily 1.5 million documents are added [4]. Such an abundance of information stored with just an informal organizational structure and over a very-wide area is bound to create a processing overload. Search engines have been developed as a first line of defense. They promise to find some relevant information on the basis of a few keywords. But even when aided by a pre-engineered index database, the operation ∗ Corresponding
author. E-mail:
[email protected].
1064-1246/03/$8.00 2003 – IOS Press. All rights reserved
needs skill and experience to provide near-acceptable results. Furthermore the first generation engines inspected only static links and consequently reached only 300 million documents [1]. Meta-engines and deepcrawlers are required to go beyond that point. The search space can already be decreased by organizational measures. In business-to-business applications, a strict and enforced standardization of the directory structure and file naming conventions allows larger information chunks than mere files to be placed at the focal point of attention. Placing the entire product documentation at the disposal of a user community can be easily and efficiently accomplished through an optimized storage agreement. On first view, the situation is more difficult in consumer applications. Here a strict standardization is more difficult to enforce and seemingly there is no opportunity for a structured decrease of the search space.
14
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
A possible exception is the virtual community, where users create and find their own corner. But again the scaling problem plays havoc. When the number of communities grows, the self-styled structure becomes hard to enforce or even maintain. Not only on the Internet, but also in any organization, the amount of textual information that needs to be processed continues to grow. It has been estimated that more than 80% of the information in an organization is stored in textual form. As we have seen a large increase in the amount of numerical and symbolic data stored in databases, such as transaction records in the telecommunications or credit card industry, we can only imagine the quantity of textual information in organizations today. As Knowledge Discovery in Databases technology is finally widely being adopted by organizations to process numerical and symbolic information stored in databases, in order to make more informed business decisions, we propose that Knowledge Discovery in Text (KDT) can be used to streamline the large amounts of textual information and put the knowledge contained in these texts to business use. In this paper we will investigate the use of KDT to the self-configuration of a virtual community portal. First we will introduce the problem area: what is a virtual community and why does an Internet Service Provider (ISP) host communities. Ensuing we overview the current application of Knowledge Discovery in Text in general and within virtual communities. After a short discussion on the MIDAS methodology, it is discussed how such a systematic procedure helps to chart the communalities between the communities and illustrate this with the results of a recent survey executed with a Dutch provider. The paper ends with a short display of the results obtained and future research directions.
2. Problem domain An ISP provides services such as Internet access, website hosting, content provisioning through portal and information services, marketplace hosting etcetera. Most ISPs charge their customers a monthly fee for Internet access, but there is an increasing number of ISPs that provide free Internet access. They make money either on the kickback fee provided by telecommunications companies based on the telephone traffic they generate, or by advertisements placed on their portals. Both sources of income depend on the number of visitors that an ISP manages to attract and retain over a period of time. It is therefore vital for free access
ISPs to offer enough interesting content on their portals. This content should attract new customers and keep current customers returning as often as possible and staying online as long as possible. 2.1. Virtual communities In [10], a virtual community is defined as an “on-line social network of a group of people with a common interest”. Virtual communities have become a popular way to create meeting rooms for similar souls. People with a special interest are attracted to a site where they are guaranteed to find other people with the same interest. This can be a common hobby, a common fanship or even a common occupation. The needs that a virtual community can serve are: information, relationships, relaxation and transactions. As [23] points out, virtual communities are a means to get people to return regularly to the same place on the Internet. This is why ISPs in general and free access ISPs in particular find it useful to host virtual communities. “Het Net” is the largest Dutch free access ISP, with over 1 million regular visitors. As stated above, this ISP has a business interest in hosting virtual communities. As starting and maintaining communities can be a very time-consuming task, “Het Net” decided to host a virtual community portal, where each visitor can start his or her own virtual community on virtually any topic. This way, most of the effort is left to the visitors, and it is up to Het Net to provide the infrastructure and services surrounding this virtual community portal. A virtual community of Het Net is known as a club, and there are currently almost 20,000 clubs, of which more than 10,000 are active on a regular basis. A typical virtual community offers a range of rooms and memorabilia. A club on the portal of “Het Net” consists largely of an agenda, a music & movies collection, a chat room, a discussion forum, a photo gallery, a flee market, recent news topics and links to other sites (Fig. 1). 2.2. The club clustering problem Every visitor of the portal can start a new club on virtually any topic that he or she is interested in. The portal provides a tree-like structure of categories in which the creator can place his or her club. The two main problems of this approach are first inconsistency and second maintenance. Inconsistency automatically ensues when humans are required to categorize items, as is well known from e.g. library cataloguing studies. In
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
15
Fig. 1. Homepage of a virtual community.
the club context this results in inconsistent club placement, where e.g. some of the Britney Spears fan clubs reside in the “music” category and other Britney Spears fan clubs reside in the “fan clubs” directory. The maintenance problem is related to the growth and decline of interest in certain topics over time. Currently, there are so many Britney Spears fan clubs on “Het Net” that this is a category of clubs in itself. However, it is likely that in the future this number of clubs will gradually decrease and eventually only a few will survive. If “Het Net” does not keep up with these changes, the category tree of clubs will become very unbalanced, and thus it will be harder for potential new members to find the clubs that they are interested in within a reasonable amount of time. Knowledge Discovery in Texts holds the promise of automating or supporting tasks that involve text documents. The communication within a club can be considered a text document. Examples of texts in a club are chat messages, posts on the forum, the club description, and etcetera. As there will be a lively communication within the active communities, one may suppose that the collection of these messages may bring the knowledge that guides the dynamic organization of the clubs on the portal. The purpose of our case study is to investigate the usefulness of KDT to automate or support the regular construction of a balanced club category tree.
A main issue in this problem domain is the nature of the text documents used. Problems in the automatic processing of club communication include: lack of attention on correctness and clearness of spelling and style, emotional coloring rather than objective semantics, as well as the particular vocabulary. As research has demonstrated, the vocabulary in speech is only 20% of the full language; the volatile communication within a community by chatting and email comes closer to oral than to written communication. Such observations lead to the question whether community communication has sufficient content to allow for automatic processing. The problem is therefore to extract from the content of the textual messages a structure of the interest, background or other participant attributes that provides enough information on the topic of the club to allow for the construction of a balanced and useful topic category tree. In this context, useful refers to the ability for new visitors to find the clubs they are potentially interested in within a reasonable amount of time. Members of a club can always locate their clubs because the portal provides a personalized club shortlist for members. Success in this application of KDT within virtual communities may pave the way for other useful applications in this volatile and unstructured text domain. In the next section, we will discuss a number of potential applications.
16
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
3. KDT for virtual communities We define Knowledge Discovery in Texts (KDT) as: “The process of extracting interesting and non-trivial patterns or knowledge from textual documents”. This definition is analogous to the definition of data mining, by substituting ‘textual documents’ with ‘large amounts of data’. Important in the definition is the fact that KDT is a process, i.e. it is not one magical technique, but constitutes a complete process of collecting information, pre-processing texts, mining the pre-processed information and using the results in an intelligent way. This process will be described in more detail in Section 4.
3.1. Definitions KDT and KDD (Knowledge Discovery in Databases) are very similar in the mining step, and many techniques familiar from data mining can be used to perform text mining as well. The main difference is in the pre-processing step. Even more than in KDD, the pre-processing step in KDT is crucial to find a representation of the texts that can be used by the text mining techniques to produce meaningful results. The current KDT literature describes an abundance of possibly useful text preprocessing steps. There are however no clear guidelines when to use which preprocessing step(s). In Section 4.2 we will discuss the preprocessing steps chosen for our case study. One question that surfaces when studying the available KDT literature is the question of the difference between the fields of KDT and that of natural language processing (NLP). NLP has been defined, as “The goal of NLP is to better understand natural language by using computers”. Both fields study texts and try to extract information. There are however three main differences, as pointed out by [11]: – KDT uses induction techniques such as classification to produce results – KDT results are comprehensible and actionable, whereas NLP can produce statistical tables and principal component analysis results – KDT is usually applied to more texts at the same time, whereas NLP is mainly concerned with studying one text at a time.
3.2. KDT applications Apparently, KDT is a process that induces meaningful and actionable information from collections of text documents. Its applications can be roughly divided into four main areas: (a) Information Retrieval, (b) Information Filtering, (c) Information Routing, and (d) Knowledge Extraction. Textual information such as the contents of a library or a collection of web pages usually need to be accessible in order to retrieve information from the collection whenever it is needed. This means that textual information needs to be organized in such a way that it can be navigated and searched. This field of expertise is generally known as information retrieval, and a large amount of the available KDT literature is concerned with this expertise. Text mining techniques can for example be applied to automatically categorize or label new documents in a collection, which was previously a labor-intensive and rather subjective task. New information is generated every day, for example newspaper information and electronic mail messages. To reduce the amount of information reaching for example the employees of a company, information filtering techniques can be applied. Simple versions of these techniques can be found for example in Microsoft Outlook’s Inbox Assistant, which can apply rules to delete messages with certain keywords in the subject or coming from a certain source. More intelligent techniques model the interests of a user and provide only those articles or messages that a user will be interested in. Text mining techniques can also be applied to assess the content of a message and then route it to the appropriate recipient. These techniques now start to be used in e.g. e-mail handling in customer contact centers. Some early experiences are discussed in [19]. The last main area of text mining applications is to extract knowledge about one or more texts, in a more concise form than the texts themselves. Examples include automatic intelligent summarization, trend detection in documents about a certain topic, analysis of co-occurring words in texts and document clustering where each cluster of documents is described very briefly which provides a quick overview of even large collections of documents. 3.3. Text mining techniques The result of pre-processing texts is usually a weighted term vector per document. This numeric representation can be used in the actual mining step in the
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
KDT process. Well-known data mining methods such as classification, association and clustering are applied in the text mining process. Also, special text mining techniques such as summarization can be applied. – Classification: the method of classification is used to label each document into specific predefined categories. Classification is usually a supervised learning process, where the labeling is based on the labeling of documents in a training collection. Techniques that can be used for text classification include among others decision trees, neural networks, k-nearest neighbor, rule induction, na¨ıve Bayes and support vector machines. Classification is used for example in information retrieval to classify documents based on their content in order to be able to retrieve them efficiently. – Clustering: the method of clustering is used to group documents into document clusters. Documents within a cluster are similar to one another, and between clusters documents are dissimilar. Clustering is usually achieved in an unsupervised training process, where the desired clustering is unknown beforehand. The most difficult part of clustering is to establish a good similarity measure, which is used to compute the effectiveness of a specific clustering, in order to select the best clustering. Clustering is used in information retrieval, to order document collections. It can also be used in information filtering: if a document is clustered into the same group as a number of documents which the user is interested in, it is likely to be of interest to this user as well. In much the same way, clustering can be used in information routing. Last but not least, clustering is also used in knowledge extraction: a clustering of a large number of documents with a short description of each cluster can provide quick insight into the structure and content of a large document collection. – Association: the method of association is used to detect patterns in term usage in documents. Through association techniques it can be discovered which items co-occur more frequently than is expected on the basis of their individual occurrences. This indicates a relation between these items. Association is mainly used in knowledge extraction. For example, association can be used to detect changes in term usage over time, by comparing association results of document collections at different points in time.
17
– Summarization: summarization is a specific text mining technique, used to provide concise and intelligent summaries of long documents. This technique can be used in information retrieval: a user can first read a summary, and if he is interested he can retrieve the entire document. In the Netherlands, this service is already offered for mobile devices by the small Sumatra company. Also, summarization is used in knowledge extraction, because a summary provides knowledge for which the user does not need to read the entire document. 3.4. KDT applications in virtual communities As described above, KDT may be applied to automate or support the construction of a topic category tree for virtual communities. There are however many other possible applications of KDT to enhance the communication in virtual communities. These applications can be divided into the same four categories as in Section 3.2: (a) Information Retrieval, (b) Information Filtering, (c) Information Routing, and (d) Knowledge Extraction. – Information retrieval: Community information on the Internet should be organized in such a way that (potential) community members can easily find the information they require. An example text mining application is to organize the many different clubs of “Het Net” based on the content of the information exchange within clubs. Another example is the construction of a personalized list of communities that a particular person may be interested in. This list can be constructed based on a comparison of the texts that the person has accessed and the texts in the communities. – Information filtering: Members of virtual communities have access to large amounts of information. Information filtering techniques can be applied to send them only the information, which they are really interested in. This service can even take the form of a “personal community visor”: a view of Internet communities based on the topics of interest to a particular user. Using this visor, an Internet user can quickly detect what other communities discuss topics that are related to his or her own interests, and can also assess the amount of information on each topic at these communities. – Information routing: New information that becomes available on the Internet can automatically be analyzed and used to inform possibly interested
18
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
users of its existence. E-mail messages posted in a community forum can automatically be redirected to community members who might be able to respond. Also, these techniques can be used to automatically remove unwanted content, e.g. messages containing racist or sexist remarks. – Knowledge extraction: Text mining techniques can be used to provide for information needs of virtual community users, e.g. automatic intelligent summarization of the content of one or several information items in the community. These techniques can also be used to help community owners by providing insight into the topics discussed, the trends or patterns in topics in different communities, the factors that influence the success and duration of a club, the factors that influence the success of banner advertisements at certain locations, etcetera. Knowledge extraction can also be used to detect a latent interest in a community on a certain topic, based on the topic drift in closely related communities.
4. The case study The structured discovery of knowledge from data and text needs a phased development process. A number of such techniques have been proposed in the past. They share the global division of labor in (a) data collection, (b) data preparation, (c) mining and (d) visualization [3]. As they also share a lack of generality and tend to focus on specific parts of the process, KPN Research has developed for internal use the Mining Data Successfully (or MIDAS for short) procedure [8]. It consists of 8 subsequent phases, which will be presented here as a further detailing of the more common 4-step procedure. 4.1. Data collection The data collection starts with the phase of problem definition. What is the actual text-mining problem? Clearly, a problem can only be solved once it has been defined. In the classical sense of the physical experiment, this means that the search question must be formulated, the existence of a data set must be established and validated, and finally the nature of the desired answer must be defined. In short, the problem needs to be formulated in terms of text-mining, and the solving strategy sketched and analyzed that will either give a
desired answer or show irrefutably that the answer is not possible. In the next step the required data must be collected and analyzed. It must be determined what kind of data is needed, where this data can be found and how they can be made available to the experiment. Textmining is usually performed on a collection of text documents, that can differ in format (HTML, Word or pdf) and stored on different media (tape, hard disk or compact disk). During the data collection, all relevant documents are retrieved from the various sources and media onto a single container. The single static set to operate on eases all the subsequent processing. They are at a single location, in a single format and will not be changed by further activities inside the community. This requirement is sufficient but not necessary. The procedure can also work within a dynamic web environment, but will then require a more elaborate scheme of time stamping and synchronization. Further the single and central static storage allows ensuring that all data will be available once collected and will not be removed while the process is well under way. In other words, it only requires agents for the data collection and not guards during the lifetime of the project [14]. For the project reported here, the documents are based on the archives within the virtual communities of “Het Net”. This is a dynamic environment as these communities are on-line for 24 hours a day. Clubmembers make access over a browser to the community database with some safety arrangements. The content of the database changes constantly as the members pass by and communicate through messages and images. For the ease of the experiment, we have refrained from a direct access to the database. We rather want a well-behaved environment for the experiment, that would enable to draw irrefutable conclusions. In other words, we have simple made an entire dump of the database content on 15 November 2000 and developed only on this copy. Despite the obvious dynamics of community communication, we will thus largely focus on a historical analysis, i.e. we will analyze the full collection of text documents on archive up to a specified moment. In this paper, that moment was 15 November 2000, resulting in a collection of 900 Mbytes of information [2]. A coarse overview of the available tables in the database is given in Table 1. All non-textual data such as images and music fragments are not contained in the database but stored on dedicated file-servers. We have exempted such files from our experiment. Also, the database tables with scarce or less relevant textual
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
19
Table 1 Content of the club database Category Agenda Album Bargain Community Document Forum Photo Link to other sites News
#Records 11985 68583 3195 15409 51078 196346 484415 112556 100725
Short description Minutes of community life Portrait gallery Internal flee-market Membership administration File directory Threaded links to the chat rooms Annotated images Private and personal links to outside the community The news corner
information have been discarded for our experiments. This leaves us with only and pure text in the categories: Forum, Photo and News. The tables for the data storage in these categories are merged to create a single community text warehouse to be operated upon. Subsequently we start to identify the variables with relevance to the search question. This assumes a reasonable knowledge about the application area. Such variables or features are not necessarily directly available and must often be deduced from the existing data. Text-mining is characterized by its insatiable desire for input features. For reason of the large amount of mutual correlation, this does not imply a very large search space and we will therefore later see that the problem space dimensions can be appreciably be brought down. Last but not least we need to select the tools and/or algorithms to be used during the coming text-mining process. Often this will not be a monolithic approach, but rather a judicious set of tool application to achieve the desired effect. In a next section we will come back on this issue with a direct focus on ways to achieve the club clustering from text documents. 4.2. Data preparation In the second phase, the selected and unified data sets must be transformed and filtered to optimally suit the application of the text-mining tools. The result of data preparation in text mining problems is usually a weighted term vector per document, suitable for further processing with techniques well known from data mining. The algorithmic side of this problem has been widely researched in the text-mining literature. There exist many nice techniques that solve a limited part of the entire problem. Hence we will see a deliberate sequence of tool application to create clean data to operate on. In terms of data-mining such steps involve domain transformation (as time sequence into spectral data), attribute typing, data coding and the division of the data set into a train, a test and a validation part.
Usability Very little Medium Medium but scarce Relevant Zero High Medium Very little Medium to High
In text-mining we see the same principles but in some disguise. We follow Luhn in [16], who proposed “that the frequency of word occurrence in an article furnishes a useful measurement of word significance”. The majority of text mining literature uses this same “bag of words” approach, where the result of the pre-processing step is usually one vector per document describing the frequency of word occurrences in that document. In order to convert documents into such vectors, the following activities can be performed: Noise removal: this eliminates in succession all kind of irrelevant data, such as mark-up tags, punctuation marks and spelling errors, cleaning up the input text. In this case study, spelling errors have not been corrected due to the lack of an easily available algorithm for the Dutch language. Domain transformation: abbreviations and low frequency words can be substituted with the normal phrase. Subsequently terms can be extracted and reduced to the stems. Stemming has not been used here, again due to the lack of a stemmer for the Dutch language. Instead, lemmatization has been used: comparing the terms in the community texts to dictionary terms in Celex [15]. If words could be mapped to more than one meaning in the dictionary, the most frequent use of the term has been chosen. The terms can also be enriched when a hierarchy of the terms is defined by the user or preloaded with the system. Terms can be added based on for example hyponym relations. Hyponym is the linguistic term for the ‘is a’ relationship – a knife is a weapon, therefore ‘weapon’ is a hyponym of ‘knife’. A rule-based learner could be aided by a feature engineering method that mapped words with low information gain to common hyponyms that yield a higher information gain. This feature engineering methods could rely on the use of WordNet, a large on-line thesaurus that contains information about synonymy and hyponymy. In the same way synonyms can be detected and transformed to one term used for further processing. In this
20
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
case study, no hyponymy or synonymy relations have been used. Feature extraction: meaningful words are detected. For one part, this is performed by checking on lists of words without meaning (stop word removal); for the other part, this is performed by building histograms, following the advice of Luhn in [16]. For the virtual communities, we have discarded the first 100 most frequent terms in the community texts, and we have discarded terms with less than 6 occurrences, resulting in a reduction from 350,000 to 60,000 terms. Usually, terms are weighted to reflect for example their use in a headline or abstract, or to better represent the usefulness of the word within a document, for example by normalizing term frequency with document length. In this case study, the tf*idf weighting scheme has been used [9]. This traditional term weighting scheme is used in all kinds of modified forms, the formal definition we used is given below: ndocs log tij ∗ log Aij = tf ∗ idf = nj dj where tij is the occurrence count of term i in document j, nj is the length of document j, d i is the number of documents in which i appears, and ndocs is the total number of documents in the collection. Additionally, word and noun phrases can be extracted. Word phrases such as ‘New York’ only bear meaning when they are recognized as being one term. When words are counted and these word phrases are split they lose their meaning. Extracting noun phrases such as ‘artificial intelligence’ from a document requires two separate algorithms. The first is a tagging algorithm to assign part of speech tags (noun, verb, preposition, etc.) to the individual words, and the second is an algorithm to group the tagged words into noun phrases. Word and noun phrase extraction has not been used in this case study. It is well known in data mining that pre-processing is time consuming: “In particular, the pre-processing phase is crucial to the efficiency of the process, since according to the results in different domain areas and applications, pre-processing can require as much as 80 per cent of the total effort.” [18]. In text mining this holds even more, as pre-processing is more difficult and domain dependent than in data mining, for example through the use of different languages in a document collection. Also, the effect of most pre-processing activities on the result of text mining is not unambiguous, and different authors disagree on the usefulness of activities. Most authors for example use some form of
term filtering, but [21] shows that restricting the number of words has only a minor effect of the performance of a text retrieval system. 4.3. Document mining The result of pre-processing is usually a weighted term vector per document. This numeric representation can be used in the actual mining. To achieve an efficient process, it is of interest to remove the potential correlation in the input features and to create an orthogonal base. This shows from a (sometimes drastic) reduction in dimensionality. Latent Semantic Indexing (LSI) [7] attempts this by replacing the histogram vector of terms by their eigen-vectors. Alternatives are a clustering in semantic categories or even random projection [11]. We used LSI and achieved a reduction from 60,000 to 80 dimensions. Well-known data mining methods such as classification, association and clustering are applied in the text mining process. Also, special text mining techniques such as summarization can be applied. However, we found that LSI imposes a constraint on such methods, as apparently not all mining techniques operate efficiently on the LSI-generated term space [15]. Clustering is used to group documents. Documents within a cluster should be similar to one another, and between clusters documents should be dissimilar. Clustering is usually achieved in an unsupervised training process, where the desired clustering is unknown beforehand. The most difficult part of clustering is to establish a good similarity measure, which is used to compute the effectiveness of a specific clustering, in order to select the best clustering. Clustering is used in information retrieval, to order document collections. It can also be used in information filtering: if a document is clustered into the same group as a number of documents which the user is interested in, it is likely to be of interest to this user as well. In much the same way, clustering can be used in information routing. Last but not least, clustering is also used in knowledge extraction: a clustering of a large number of documents with a short description of each cluster can provide quick insight into the structure and content of a large document collection. There are several algorithms to perform clustering. Clustering algorithms can be classified according to two aspects [17]: the generated structure, which could be hierarchical, flat or overlapping, and the technique used to implement the structure. This technique can be either partitional or agglomerative. Partitional methods
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
divide the set into several clusters at once and then shuffle examples between clusters in order to increase cluster similarity and decrease cluster dissimilarity. Agglomerative clustering methods gradually add examples to clusters until the final clustering is achieved. Agglomerative cluster methods can be seed-based, which means that they randomly pick a number of examples from the set as the cluster centres, and gradually add the remaining examples to these clusters. Other non seedbased methods regard every example as a cluster of its own, and continuously fuse the most similar clusters. In this case study, we have used is a hierarchical agglomerative clustering (HAC) method. Advantages of this method are its simplicity, speed and global optimisation [22]. At the core of any clustering algorithm lies a means to quantify the attraction between features. From the vector representation of the input features follows the notion of distance, measured in a non-linear ndimensional space. There is more than one way to measure such a distance: either Euclidean (can be measured with a ‘ruler’) or based on similarity. For example, in terms of road distance (a Euclidean distance) York is closer to Manchester than it is Canterbury. However, if distance is measured in terms of the characteristics of a city York is closer to Canterbury. Euclidean metrics measure true straight-line distances in Euclidean space. Non-Euclidean metrics apply to distances that are not straight-line, but which obey certain rules. The Manhattan or City Block metric is an example of this type. Semi-metrics obey the first three rules but may not obey the ‘triangle’ rule. The Cosine measure is an example of this type. This is a pattern similarity measure. The cosine of the angle between 2 vectors is identical to their correlation coefficient: Similarity (x, y) =
Σ(xy) Σx2 Σy 2
From the dimensional reduction obtained by applying LSI follows that the problem space is inherently both non-linear and multi-dimensional. Consequently, the clustering must be performed with semi-metrics, as experimentally confirmed in this project independent from [13]. Clustering methods will create clusters either sequentially or in parallel. This is purely an algorithmic difference. Actual clustering tools will also provide a number of cost functions, that the algorithm will use to decide on the understanding of the measured distances. Altogether this gives the miner a lot of freedom and
21
therefore a choice problem. We have pre-selected the following cost principles: (a) average linkage between groups, (b) average linkage within groups, (c) centroid, (d) median, (e) k-nearest neighbour and (f) Ward. The HAC algorithm fits very well with LSIvectors [20], because it organizes documents in a treelike structure. The most popular implementation of the HAC algorithm uses the nearest neighbor cost function. It measures the similarity by joining the most similar pair of objects that are not yet in the same cluster. The distance between 2 clusters is the distance between the closest pair of points, each of which is in one of the two clusters. The type of clusters that HAC was designed for are characterized as being (a) long straggly clusters, (b) chains or (c) ellipsoidal clusters. From its thoroughly understood and well-developed theoretical basis comes a very efficient implementation. No foreknowledge in terms of a cluster centroid or representative is required; also there is no need for re-computation of the similarity matrix during the clustering. However, it has difficulty in handling poorly separated or intertwined clusters. A second choice in clustering is an appropriate measure for the distance between two documents. The problem space of LSI term vectors is inherently both non-linear and multi-dimensional. Consequently, the clustering must be performed with a semi-metric distance measure such as cosine or Euclidean distance, as experimentally confirmed in this project independent from [13]. In this case study, cosine distance has been applied, as this is the most appropriate measure for texts. A third choice in clustering with the HAC algorithm concerns the cost function, i.e. the distance between clusters, to determine which two clusters are most similar and should be joined in the next step. In-line with the conclusions in [24], we have initially used the computationally efficient “nearest neighbor” measure. However, the application of the nearest neighbour method to the case study was only moderately successful. The construction of a single very large cluster in the presence of many small ones organizes the communities in an unbalanced tree, which will lead to a lack in accessing ease over the portal (Fig. 1) and it was concluded that less computationally simple distance measures should be evaluated. We have pre-selected the following cost principles: (a) average linkage between groups, (b) average linkage within groups, (c) centroid, (d) median and (e) k-nearest neighbour [13].
22
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities 10000
Num ber of clubs
1000
100
10
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 clu sters
Fig. 2. Histogram of cluster size after HAC clustering with nearest neighbours. 10000 Average Linkage (Between groups) Average linkage (within groups) Centroid
cluster size
1000
100
Nearest Neighbour 10 Median
19
17
15
13
11
9
7
5
3
1
1 cluster number
Fig. 3. Comparison of cluster methods.
4.4. Evaluation and use A comparison of clustering mechanisms has been performed on a first division in 20 clusters. This confirms the previous observation that HAC tends to result in extremely unbalanced categorizations (Fig. 2). The “average linkage within groups” method results in the only clustering with clusters of size 100 to 1000 clubs, all other methods create one cluster with almost 10.000 clubs and 19 clusters of on average 10 clubs. This is not a desirable feature of a topic category tree in a portal hosting almost 20,000 clubs. Figure 3 shows clearly that the within groups average linkage method is best suited for our purposes, as it provides the most balanced categorization. The clusters are still large, but this is not a problem because each cluster can be subdivided in a second run. After selecting the average linkage within groups clustering algorithm, the quality of the resulting cat-
egorization has been evaluated. The clubs were first clustered into eight groups or main topic categories. After that, the clubs in each main topic category were again clustered into eight groups or minor topic categories. The resulting clusters or categories were labelled manually by inspecting the content, i.e. by inspecting the names of the clubs. Figure 4 shows a schematic overview of a part of the result. All clubs in a cluster are covered by their respective label, with only a few exceptions. For instance, the ‘Dutch soccer players’ cluster also contains a club called ‘Girls!’. Like most of the anomalies, this one can be explained by inspecting the content of the club, it seems that these girls like to talk about soccer-players. The necessity of preselecting the number of desired clusters is problematic. Due to the still somewhat unbalanced clustering results, the clusters differ in size. The large clusters cover a much broader topic than the small ones, which is not desirable. For instance, ‘Asso-
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
23
root
M usic
Computers & amusement
Sports
Dutch soccer players
International soccer
Associations, Sex, TV am usem ent anim als Anm im als cars, motorbikes, Chat, M ov ies
Dragon Ball & Pokemon
Feyenoord
Dutch m ajor league
trains
athletics, tennis, Basketball swimming bicycling Minor (T our de france) league
Fig. 4. Topic category tree after clustering.
ciations, TV Amusement, Animals’ contains over 2000 clubs and therefore covers a broad topic. The small clusters are more specifically targeted at a small topic, e.g. the ‘Pokemon & Dragon Ball’ cluster only contains 250 clubs. In this case, a cluster ‘Amusement’ on the top level would be more appropriate, the ‘Pokemon & Dragon Ball’ cluster would fit in perfectly. A more flexible clustering algorithm, which determines the most appropriate number of clusters, could solve this problem. Also, some topics are represented in different clusters. For instance, ‘chat & soaps’ contains clubs about television soaps, but the ‘Associations, TV Amusement, Animals’ section also contains clubs about (other) soaps. This makes a directed search difficult. One could distinguish user actions in doing a directed search on one side of the spectrum, and browsing on the other the other side [6]. This clustering would have more use in a browsing environment. The above evaluation shows that automatic clustering of clubs on the “Het Net” portal is indeed feasible. The first results render a topic category tree that is relatively balanced and logical. This automatic clustering can be the basis for a final topic category tree. This final step would involve moving the few illogically placed clubs such as the “Girls!” club, and rebalancing the very large or very small topics or subtopics. This final step would require relatively few work, compared to the huge task of sifting through 10,000 clubs manually and determining what categories to place them in. Due to the general nature of our approach, we are confident that the same result can be achieved for other virtual communities or other textual web content. An open question in clustering and in this case study is the reproducibility of the results. There is no guar-
antee that the clustering of clubs at a later time will yield roughly the same categories. It is not desirable to completely change the topic category tree every month or so. Therefore, further research is needed to determine how previous clustering results can be taken into account to ensure some stability in the topic tree.
5. Conclusions This paper provides a step-by-step critique on the application of KDT to the categorization of virtual communities by mining the large variety of volatile and unstructured texts as communicated in the clubs of “Het Net”. The purpose is to give a “proof-of-concept” for the maturity of the KDT technology in such a turbulent environment, demonstrating the feasibility of automatic maintenance of communities as hosted by an Internet Service Provider. As a first attempt, the experiment is reasonably successful. Text mining provides a drastic reduction of the feature space and clears the way for a restructuring of the access to the 10,000 clubs. The weak point for an all out automation remains the clustering itself. Here, known improvements to the creation of a balanced structure tree need to be introduced. At the moment, the detailed analysis of the required steps and the conscientious integration into a KDT suite of algorithms eases a further experimentation. This serves a fast turnaround experimentation with alternative tooling and also supports future in-line trend monitoring. The fact, that even for the modest case study outlined here the resulting categorization was intuitively almost correct gives sufficient credibility.
24
A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities
Acknowledgements This work was performed while Oudshoff, Bosloper and Klos were at KPN Research Laboratories in Groningen (The Netherlands), and Spaanenburg was on sabbatical leave from Rijksuniversiteit Groningen. A short version of this paper was presented at the BNAIC’01 conference in Amsterdam (The Netherlands).
References [1]
K.D. Bollacker, S. Lawrence and C.L. Gilles, Citeseer: an autonomous web agent for automatic retrieval and identification of interesting publications, Nature 400 (1998), 107–109. [2] I.E. Bosloper, Categorizing communities on the Net, MSc. Thesis, Groningen University, The Netherlands, 2001. [3] R. Brachman and T. Anand, The process of knowledge discovery in databases: a human-centered approach, Advances in Knowledge Discovery and Data Mining (1996), 37–57. [4] Bright Planet, The deep Web: surfacing hidden value, white paper (http://www.brightplanet.com/deepcontent/tutorials/ DeepWeb/index.asp), 2000. [5] CELEX, The Dutch Centre for Lexical Information, http://www. kun.nl/celex/. [6] D.R. Cutting et al., A Cluster-based Approach to Browsing Large Document Collections, Proceedings 15th International SIGIR, 1992, pp. 318–329. [7] Deerwester, Dumais, Furnas, Landauer and Harshman, Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science 41(6) (1990), 391–407. [8] M.P. Dieben, S.H. Kloosterman and A.M. Oudshoff, MIDAS of hoe een database in een goudmijn verandert (in Dutch), Internal report (KPN Research, Groningen), 1996. [9] S.T. Dumais, Improving the Retrieval of Information from External Sources, Behavior Research Methods, Instruments and Computers 23(2) (1991), 229–236. [10] J. Hagel and A.G. Armstrong, Net Gain, Harvard Business School Press, 1997. [11] Y. Kodratoff, Knowledge discovery in texts: a definition, and
[12]
[13]
[14]
[15]
[16] [17]
[18]
[19] [20]
[21]
[22]
[23]
[24]
applications, Foundations of Intelligent Systems. 11th International Symposium, ISMIS’99. Proceedings, 1999, pp. 16– 29. T. Kohonen et al., Self organization of a massive document collection, IEEE Transaction on Neural Networks 11(3) (2000), 574–585. G.N. Lance and W.T. Williams, A general theory of classificatory sorting strategies, Computer Journal 9(4) (1967), 373– 380. D. Landau et al., TextVis: an integrated visual environment for text mining, Proceedings second European Symposium on Principles of Data Mining and Knowledge Discovery PKDD’98, Nantes, France, 1998, pp. 56–64. T. Letsche and M. Berry, Large-scale information retrieval with latent semantic indexing, Inforation. Sciences 100 (1997), 105–137. H.P. Luhn, The automatic creation of literature abstracts, IBM Journal of Research and Development 2 (1958), 159–165. Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul and D. Pelg, Ephemeral document clustering for web applications, IBM Research Report RJ 10186, 2000. H. Mannila, Data mining: machine learning, statistics, and databases, Proceedings of the 8th International Conference on Scientific and statistical Database Management, Stockholm, Sweden, 1996. M.A.H. Offenberg, ICT impact and organizational change, MSc. Thesis, Groningen University, The Netherlands, 1998. E. Rasmussen, Clustering Algorithms, ch. 6, in: Information Retrieval: Data Structures and Algorithms, W.B. Frakes and R. Baeza-Yates, eds, New Jersey: Prentice Hall, 1992. G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Technical Report 87-881, Cornell University, Department of Computer Science, 1987. R. Sibson, SLINK: an optimally efficient algorithm for the single link cluster method, Computer Journal 16(1) (1973), 30–34. L. Wladimiroff, T.W. Geurts and S. Thie, Virtual communities: Service aan en interactie met klanten, (in Dutch), Internal Report (KPN Research, Groningen), 1998. O. Zamir and O. Etzioni, Web Document Clustering: A Feasibility Demonstration, Proceedings SIGIR Conferences on Research and Development in Information Retrieval, 1998, pp. 46–54.