Generating Dynamic and Adaptive Knowledge Models for Web-Based Resources
Lei Li Computer Information Systems Department Georgia State University Email:
[email protected] Vijay K. Vashnavi Computer Information Systems Department Georgia State University Email:
[email protected] Art Vandenberg Information Systems & Technology Georgia State University Email:
[email protected] Seema Metikurke Computer Science Department Georgia State University Email:
[email protected] Abstract: The number of web pages on the World Wide Web continues to grow exponentially as more and more people, organizations, and businesses rely on the Internet to share and search information. These web-based resources are generally organized and presented in a hierarchical manner based on a certain categorization structure. We call the categorization structures used for organizing web-based resources their respective knowledge models. Such knowledge models are usually static: their creation and maintenance is normally done manually offline or in a humanmediated manner that makes it difficult and time-consuming to adapt them to the dynamically changing web-based resources. More importantly, while different users may have different perspectives and needs with respect to the existing knowledge model, they are restricted to use the view provided by the current model. For example, in a university website, faculty members are usually listed by departments. A web user may instead want to view how the faculties can be grouped by their research interests. In this paper, we propose a user centric approach that will automatically identify the web resources of interest and systematically generate and maintain knowledge models for the identified web resources while adapting to different user viewpoints. We apply such technologies as automatic web page classification, ontology, genetic algorithms, and Neural Network based Self-Organizing Maps (SOM) clustering. To illustrate the feasibility of our approach, a preliminary study on a set of faculty pages is presented. The contribution of this paper falls into following areas: 1) proposing a systematic approach to cluster web-based resources using a SOM algorithm that makes creation and maintenance of the knowledge models for those web sources easier; 2) creating adaptive knowledge models for the identified web resources by an interactive process to facilitate user centric browsing and searching. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee. DESRIST 2006. February 24-25, 2006, Claremont, CA. CGU 2006
Keywords: dynamic and adaptive knowledge models, automatic knowledge model generation, web-based resources, knowledge management 1. Introduction The number of web pages on the World Wide Web continues to grow exponentially as more and more people, organizations, and businesses rely on the Internet to share and search information. Web-based resources are generally organized and presented in a hierarchical manner based on a certain categorization structure. For instance, a university homepage may have sub-pages for each college; each college having sub-pages for each department; and then the department pages may have sub-pages for each faculty member. This would represent a University/college/department/faculty categorization structure. We call such categorization structures used for organizing web-based resources their respective knowledge models. Such knowledge models are usually static: their creation and maintenance are normally done manually offline or in a human-mediated manner that makes it difficult and time-consuming to adapt them to the dynamic nature of web-based resources: pages are added, revised, deleted, and moved with high frequency. Moreover, different users may have different perspectives and needs with respect to the existing knowledge model of a given set of web resources, yet they are restricted to use the view provided by the model. The motivation for dynamic and adaptive knowledge models can be further illustrated by the following use cases: (1) The Southeastern Universities Research Association (www.sura.org) has a strategic initiative to develop grid infrastructure for its 62 member universities and is compiling a knowledgebase of faculty researchers and grid activity. An initial knowledgebase of faculty web pages from 9 SURA sites was developed manually by a dedicated researcher browsing each university site, a task spanning many months. Meanwhile the underlying knowledge as well as the perspective on the knowledge model is constantly changing; faculty move, pages change, and 57
SURA now wants knowledge on all 62 sites along with perspectives that go beyond grid related information. (2) A faculty who specializes in computer graphics is contemplating changing position, possibly an appointment with the computer Information system department of a large state university. This researcher wants to review potential research collaborators in computer science departments of several potential universities in order to help him decide his options. The researcher plans to find this information through the Internet and begins by visiting the website of the computer science departments of one of these universities. The faculty pages are listed on the website alphabetically by name but there is no categorization structure based on research interests and the available onsite search function is limited to keyword searches. This researcher therefore uses a popular search engine (Google, www.google.com) to search the website (specifying the university’s domain as a search base) using keywords “computer graphics faculty.” Of the 25 search results (search date: October 9, 2005) returned by Google, only two results contain relevant information and only one of the results directly links to a faculty web page (there are actually four faculty members in the department specializing in computer graphics). Ultimately the researcher gives up and just reads through every faculty page at the site. This is a tedious and time-consuming process – especially if it must be repeated for several university sites. In the above two use cases, dynamic and adaptive knowledge models of the underlying web resources would greatly facilitate their understanding and use. In this paper, we propose a user centric approach that will automatically identify web resources of interest and systematically generate and maintain dynamic knowledge models for a given set of web resources while being adaptive to different user needs and perspectives. Our approach will support automatic identification of web resources, based on user specified profile information (such as a profile based on a set of example web pages) and create knowledge
58
models for them using a facilitated user interaction process. Further, given an existing set of resources and their knowledge model, our approach would be able to closely replicate and automatically extend that knowledge model to incorporate additional web resources. Once the knowledge models are created, they could be automatically maintained to reflect the changes occurring in the underlying web resources – the dynamic feature of our approach. More importantly, the knowledge model created can adapt to different user needs, permitting multiple user viewpoints and individual perspectives of the same set of web resources as suggested by the above use cases. In the proposed approach, web page resources of interest are automatically identified (incorporating user profile specifications); then keywords are extracted; finally the web resources are clustered (with options for user personalization) and the clusters are labeled. Our approach applies and uses such technologies as automatic web page classification and identification, ontology, genetic algorithms (GA) (Holland 1992), and Neural Network based Self-Organizing Maps (SOM) (Kohonen 1995). The rest of the paper is organized as follows. Section 2 discusses related research in the area of automatic web page classification, personalized search, user adaptive software systems and clustering. Section 3 introduces the proposed approach for generating dynamic and adaptive knowledge models. A preliminary study on faculty web pages is presented in Section 4. Finally, we discuss the contributions of the paper in Section 5. 2. Related Research 2.1 Automatic Web Page Classification The first step of our approach is to identify the web resources of user interests, e.g., finding all the faculty web pages in a university web site. This section discusses the studies in area of information retrieval and classification. 59
In the earlier 1960’s, Fairthorne (Fairthorne 1961) and Hayes (Hayes 1963) have demostrated that the classification process can greatly improve the efficiency of information retrieval. Flake et al. (Flake et al. 2002) show that using URLs and HREF links alone can be used to organize pages into related community groups. Kennedy and Shepherd (Kennedy and Shepherd 2005) use a neural net based classifier to distinguish home pages from non-home pages and to classify those home pages as personal home pages, corporate home pages, or organizational home pages. Craven et al. (Craven et al. 2000) conducted a study applying machine learning algorithms to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web and uses that information for more effective web information retrieval. Some researchers are taking new approaches for dynamically monitoring web-based data and automatically establishing relationship models. Tomiyama et al. incorporate artificial intelligence techniques for automatically identifying similar concepts (Tomiyama et al. 2003). Salton did a number of experiments to emphasize that a classic information retrieval (IR) model such as the vector model outperforms other IR models (for example, the probabilistic model) for general collections (Salton 1991). In this paper, we integrated a web information classifier scheme with a successful commercial search engine, Google (www.google.com), to identify the web resource of user interests. We first use Google to do an initial search and then classify the search results. We adapted Salton’s vector model (Salton 1991) for classification and Kennedy & Shepherd (Kennedy and Shepherd 2005)’s method for selecting initial candidate web pages. Salton’s vector model not only generates a good ranking strategy, but also is simple and not very computationally intensive. 2.2 Clustering techniques 60
Clustering is a well known technique to classify patterns (observations, data items, or feature vectors) into groups (Jain et al. 1999). Clustering is well suited to our problem domain since it pertains to generating classification structures for web resources. We adapt a neural network based unsupervised clustering technique - Self-Organizing Maps (SOM) (Kohonen 1995). SOM produces a similarity graph of input data by converting the nonlinear statistical relationships between high-dimensional data into simple geometric relationships of their image points on a low-dimensional display, usually a regular two-dimensional grid of nodes. Thereby it compresses information while preserving the most important topological and geometric relationships of the primary data elements on the display. Zhao (Zhao and Ram 2004) compared Self Organizing Maps (SOM), K-means, and hierarchical clustering for clustering relational database attributes. Zhao concluded that the three methods have similar clustering performance though SOM is better than the other two methods in visualizing clustering results. Mangiameli et al. (Mangiameli et al. 1996) compared SOM and seven hierarchical clustering methods experimentally and found that SOM is superior to all of the hierarchical clustering methods. Rauber and Merkl successfully applied the SOM technique to automatically structure a document collection and to create a digital library system (Rauber and Merkl 1999). While SOM has been proved to be very effective in clustering both structured information (e.g. relational database schema ) and unstructured information (e.g. documents), its clustering performance is greatly affected by its parameter values and feature set selection (Kohonen 1995) (Liang et al. 2006; to appear). In this paper, we use SOM to further organize the web resources identified by initial classification procedure.
61
2.3 Personalized Search and Presentation Based on our knowledge, few studies have been done on the adaptivity of knowledge models for web resources. However, considerable research has been conducted in a closely related area: personalized search. Commercial search engines such as Google (www.google.com) and Yahoo (www.yahoo.com) have provided personalized search services based on a user’s preference. Teevan et al. proposed search algorithms that consider a user’s prior interactions with a wide variety of content to personalize the user’s current web search (Teevan et al. 2005). Most of such personalized search engines focus on finding the user related information but neglect the mechanisms for creating a knowledge model that suits the user needs. We are interested in such knowledge models, their generation, and how the information is organized. Many studies organize the search result by clustering the search results themselves. Gauch et al. (Gauch et al. 2004) reported a research that adapts information navigation based on a user profile structured as a weighted concept hierarchy and provided the users personalized search and browsing. Ferragina and Gulli (Ferragina and Gulli 2005) introduced a personalized search engine, called SnakeT, that is able to process search results generated from commodity search engines, and to create, on-the-fly, a hierarchy of labeled folders using a hierarchical clustering technique. The SnakeT engine demonstrated performance similar to the commercial search engine Vivísimo (http://vivisimo.com/html/vce). Zeng et al. (Zeng et al. 2004) also conducted a study on organizing Web search results into clusters. They focused on how to effectively name the clusters based on a regression model learned from human labeled training data. One potential drawback of the above clustering research approaches is that they don’t allow much interaction between the system and the user (what the user sees is the same search results but presented in a different manner). While this could be a feature in some cases because it avoids inconveniencing
62
the user, we argue that user interaction is important, if not critical, to creating adaptive knowledge models. The user needs to interact with and give some direction to the clustering engine. This way the clustering process can be performed a few times until it reaches a reasonable level of performance for the user’s needs. 2.4. User Adaptive Software Systems Since this paper deals with the adaptiveness of knowledge models, another related research area is user adaptive software systems (Schneider-Hufschmidt et al. 1993). One research stream in user adaptive software systems is adaptive hypermedia. Adaptive hypermedia adapts the needs of the user by building models of knowledge preference for each individual user (Brusilovsky 2001). Adaptive hypermedia systems includes educational hypermedia, on-line information systems, information retrieval hypermedia, systems for managing personalized view in information spaces, etc. (Brusilovsky 2001). The systems for managing personalized view are particularly related to this research. For example, My Yahoo (http://my.yahoo.com) lets users customize their Yahoo homepage. However, the majority of personalized sites are adaptable but not adaptive (Brusilovsky 2001). Moreover, such adaptation is usually quite simple: users can only add or remove contents from the web pages. In this paper, we are seeking to bring user centric adaptation to the selected web resources so that the users can reorganized those resources based on their own perspectives. 3. Research Approach We propose to research and develop dynamic and adaptive knowledge models (Figure 1) for web-based resources. Our research is built on the following proposition: by applying such techniques as automatic web page classification, keyword extraction, neural network based clustering, genetic algorithms, and using domain ontology resources where available, a computer 63
system can automatically extract, analyze, and present information from web sources to produce a knowledge model without a priori information or can closely replicate an existing knowledge model. Such knowledge models
User Views
Dynamic & Adaptive Knowledge Models
would automatically reflect
Model Components: Keywords Ontology Clustering
the changes in the Web Resources
underlying web resources and adapt to different user
Information Layers
Figure 1. Adaptive knowledge Model for Web-based Resources
viewpoints or perspectives. We make two assumptions regarding the web resources 1) the web resources investigated contain information that is sufficient for a human being to create a knowledge model; 2) The domain of web resources has sufficient information to train a neural network based SOM algorithm. The research approach has two phases: 1) model preprocessing and 2) model building. The research framework is shown in User Profile
Figure 2.
Web Source Identification
In the model preprocessing phase, the first step is to let the
Model Preprocessing
Keyword Extraction
Ontology
user provide a profile reflecting Clustering
her perspective or needs. A useful user profile can be composed of a mixture of 1) user specification of a web site exemplifying the user’s
Labeling
Model Building
Evaluation
Figure 2. Proposed Research Approach
64
perspective, 2) pre-existing profiles, and 3) user selection from existing keywords or category/subject lists. The user profile is used to guide what information to retrieve, what keywords to select for clustering, or what kind of ontology to use for labeling the clusters. The second step of model preprocessing is to identify the set of web resources. These web resources could be given by the user (the user provides an URL of a website, e.g. the department site of a university) or can be retrieved from the World Wide Web using a web crawler or a search engine based on the user profile. For example, a user may be interested in faculty members who work in the grid computing area in a university. In this situation, a web crawler technique and/or commercial search engines can identify a set of potentially relevant web pages. The third step in model preprocessing is keyword extraction. Considerable research has been done on information extraction from web pages (Flake et al. 2002) (Chang et al. 2001) (Crescenzi and Mecca 2004). In our approach, all the text content from the identified web pages is extracted as potentially useful for knowledge modeling. Target web pages’ content is extracted using the “View Source” browser feature. Selecting all space-delimited tokens within the extracted content generates a set of keywords. Research questions in this procedure include: 1) How deeply should the crawler search a web page in following hyperlinks or opening additional pages? Theoretically the web crawler can go as deep as it chooses to. In our preliminary study, we just extracted the information from the first page the Google search engine returned. We can optionally extend this approach when treating more complicated domains. 2) How to filter the keywords in the generated set of keywords? The keywords can be filtered, for example, by eliminating “stop” words, consulting the user profile, employing domain ontology, or setting frequency counts. For example, if a user wants to cluster faculty web pages
65
based on her own research interests, all the keywords unrelated to the user’s research interests could be removed. In the fourth step, the extracted keywords are used as a feature set in the model-building phase. By clustering, the set of input web pages are grouped into related groups based on the feature set. The clustering result of SOM is sensitive to the selection of its parameter values (Kohonen 1995). While SOM is used widely in document classification, few studies have been conducted on selecting optimal SOM parameters or their values. Liang et al. (Liang et al. 2006; to appear)researched SOM parameters for their directory project by trying different permutations of such values. In the current research, we use a genetic algorithm (Holland 1992) (GA) to discover near-optimal SOM parameter values. Research issues include: 1) Which keywords are sufficient to classify pages: all keywords, those occurring within certain frequency ranges, or those that occur on some percentage of pages? There is no theoretical solution to this question. In our approach, the frequency count threshold value is determined empirically. 2) It is common that the hierarchy of web-based information has multiple levels. How can we model such multiple levels of information? While we currently focus on providing one level knowledge model, we are investigating hierarchical SOM (Rauber et al. 2002) that could be a potential solution for this issue. 3) How to validate the resulting clusters since it is unsupervised? If there already exists a knowledge model for the web resources reflecting the user profile (e.g., the faculty in the university grouped by departments), that model can be used as a reference model for guiding the GA to optimize SOM parameter values. In such a case, the whole model building process can be fully automated. However, if there is no pre-existing knowledge model reflecting the user 66
profile, the SOM clustering result will optionally need to be reviewed by the user. The user may make some adjustment and SOM will re-cluster the information based on the user’s recommendation. This process might be repeated until a satisfactory result is obtained. The fifth step in the process is labeling clusters with meaningful names that can be important for the knowledge model generated. If there already exists a knowledge model reflecting the user profile, the initial generated model can use such pre-existing labels. If there is no such preexisting model, then domain ontology may play an important role in labeling. Research issues include creation and maintenance of the ontology and the user perception of labels as being useful to the user profile. Once the initial knowledge model is created, it can be configured to dynamically adapt to changes in the underlying knowledge. For example, if a new faculty joins the university, our system would include her web page in the clustering of researcher pages – resulting in an updated knowledge model. To have the model adapt to a new perspective, the user would modify her profile. In this case, the preprocessing and building process may be repeated, and the user could iteratively customize the knowledge model further by specifying additional information, such as adding missing terms or identifying irrelevant terms that should be removed. Missing or irrelevant items are used as penalties during re-training to improve the accuracy and precision of the resulting model. The sixth and final step of the research approach is the evaluation of the system. We propose to conduct and extend experimental validation following the method that has been used in preliminary work as described in the following Section. 4. Preliminary Studies and Results
67
To test the feasibility of our approach, we conducted preliminary studies on a relatively simple domain: a set of faculty web pages of a large research university. The experimental raw data was directly drawn from Google search engine. In the study, we demonstrated that a set of web pages can be automatically identified; an existing knowledge model can be automatically replicated; and the created model can adapt to changes in the underlying web resources and different user perspectives. 4.1 Evaluation Metrics In this paper, we used standard metrics in the information retrieval domain, recall and precision, to measure the performance of our faculty web page identification approach. Recall = number of retrieved and relevant documents / number of relevant documents Precision = number of retrieved and relevant documents / number of retrieved documents To evaluate the performance of SOM clustering, we adapted the two metrics, cluster recall and cluster precision, from (Mangiameli et al. 1996). They are defined as follows:
Cluster recall (CR ) =
total number of correct associations in computer clustering total number of associations in manual human clustering
Cluster precision(CP ) =
total number of correct associations in computer clustering total number of associations in human clustering
Note: 1) An association refers to a pair of objects (the basic unit of clustering; in our case, a single faculty web page); 2) Computer clustering refers to the clustering result generated by our approach; 3) Human clustering refers to the clustering result created by a domain expert. The overall clustering performance is measured by F-measure value (Larsen and Aone 1999) (Stein and Eissen 2002) (Van Rijsbergen 1979). F-measure is a mechanism to provide for an
68
overall estimate of the combined effect of recall and precision. F-measure is a standard evaluation metric in the field of information retrieval. The F-measure formula is expressed as: F − Measure =
(BETA^2 + 1) * R * P (BETA^2 * P) + R
BETA is the relative importance of recall vs. precision such that a BETA value of 0 means that F-Measure = precision; BETA value of ∞ means that F-Measure = recall. (BETA=1.0 means that recall and precision are equally weighted; BETA=0.5 means that recall is relatively less important than precision; BETA=2.0 means that recall is relatively more important than precision.) In our case, we choose BETA to be 1.0, assigning equal importance to recall and precision. The higher the F-Measure value, the better the clustering result. 4.2. Preliminary Study I The first study focused on automatically identifying the web resources of user interests. Specifically we want to get a set of faculty web pages from a university website. Those web pages could be further analyzed in Preliminary Study II and Preliminary Study III. A training set of 40 faculty web pages was used and generated over 20,000 keywords. A program processed each URL, using the View Source feature to download the page data, and selected all spacedelimited text as keywords. A “stop” word list (taken from a link found using Google Search = “common words in English,” adding punctuation and other non-alpha terms, and removing words less than 4 characters in length) reduced the size of the original keyword list to 4,977. These keywords were ordered by the number of pages that contained at least one occurrence of the keyword – creating a list of words (Table 1) that occurred on the web pages (from being present on all pages to occurring on at least 1 page). These keywords were retained ‘as is’
69
without performing stemming on them. However, the proposed approach can be extended to support stemming. To test the classification, a set of candidate web pages was obtained by collecting the search results returned from Google instead of spidering through the web pages from an initial set of URLs. This set of candidate pages was created using the query pattern as suggested by Kennedy & Shepherd’s (Kennedy and Shepherd 2005) classification of home pages: a Google Search found pages containing all three words faculty and professor and email, either of the words he or she, at the domain of a university (cf. .edu.) This Google search returned a result set of 480 pages, presumably faculty home pages that would provide information on research interests. Of this set, 200 PDF pages were removed, and 14 pages had “404 page not found” errors. The remaining 266 included 145 actual “faculty web pages” and 121 other news pages, lists of links, departmental or non-faculty pages (this was based on human inspection). The classification program then processed each of the 266 URLs, using View Source to download the page data. The classification program then tested several patterns of keywords from the training pages (those occurring over a defined threshold number of pages) to classify the 266 target pages (see Table 1). Classification results (Table 2) show that using just 0.3% of the top keywords (those occurring on most pages) resulted in 85% of faculty pages being correctly classified (recall), with an overall F-Measure = 0.6703. The classifier precision of 0.5511 (see Table 2) obtained by just using 0.3% of top keywords turned out to be better than the precision obtained from the set of candidate pages returned by Google advanced search calculated as 0.5451 (that is 145/266).
70
While additional research may improve recall and precision, this initial study indicates the feasibility of automatically classifying target pages. In future, we will investigate a genetic algorithm as a means for improving these results[LL1]. Table 1. Keywords Used to Classify Retrieved Pages as Faculty Pages Keyword (document frequency # of pages on which keyword appeared) University (37) Univ (37) State (36) Prof (36) Georgia (34) Research (33) Professor (33) Search (33) Cult (31) Public (30) Faculty (30) Mail (28) Publication (27) Interest (27) Department (27) Publications (25) Phone (23) Form (23) Student (23) Amer (21) Pers (21) Home (21) Interests (20) Program (20) Staff (20) Port (20) Journal (20) Gene (19) America (19)
Top 0.6% keywords (29 total) √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
Top 0.4% keywords (19 total) √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
Top 0.3% keywords (15 total) √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
71
Table 2. Metrics on Classifying Retrieved Pages as Faculty Pages Number (percent of 4977) of classification keywords used
Retrieved Retrieved and Relevant Pages Pages (Faculty)
Recall
Precision
29 (0.6 %)
107
49
0.3379
0.4579
19 (0.4 %)
183
104
0.7172
0.5683
15 (0.3%)
225
124
0.8552
0.5511
4.3. Preliminary Study II In a second study, we showed that our approach could automatically replicate an existing knowledge model. Thirty-three faculty pages were randomly selected from 4 departments within the same university. In the keyword extraction phase, “stop” words were removed (using the same mechanism as in Preliminary Study I). A feature set of 237 keywords was selected based on keywords occurring on 20%-40% of the faculty pages; in this study no domain ontology was used to further refine the keywords. The web pages, represented by vectors based on the feature set of 237 keywords, were processed by the SOM engine. A SOM parameters value set was “discovered” using a genetic algorithm that sought clustering based on departments. The clustering result is shown in Figure 3, wherein faculty members in the same department are correctly clustered together. To further test the approach, we randomly selected a biology faculty web page that was not present in the original list. After extracting keywords and generating its feature vector based on the master feature vector, it was placed into the generated model (placed in the SOM map using the same parameter values discovered during the original training of the map). The new biology faculty web page was grouped with the other biologists as shown in Figure 4.
72
Figure 3. Clustering result for 33 randomly selected faculty pages
Figure 4. Clustering result for 34 randomly selected faculty pages
73
The second study shows that: 1) SOM can effectively classify selected web pages; 2) a genetic algorithm can be beneficially used to optimize SOM parameter values. In summary, our initial work showed the approach can effectively categorize and identify the web sources of interest and generate a meaningful knowledge model. To achieve improved adaptability, our current goal is to refine the knowledge model by applying domain ontology and keywords selection. 4.4. Preliminary Study III In this study, we looked to demonstrate the adaptiveness of our approach. We address the faculty browsing and searching problem that was described in the introduction section: a researcher wants to find research partners in Computer Science department, but there is no preclassification of faculty pages based on research interests or the search tool is limited. In the first case, the researcher preferred to see a knowledge model for the given web resources (faculty pages) based on research interests (not, say, by college and department). The researcher supplied the required profile perspective, in this case, specifying particular research interests. All faculty pages from the researcher’s department were retrieved, keywords (based on the indicated research interests) were extracted, and the pages were clustered using a typical (default) set of SOM parameter values. The result was presented to the user as a two dimensional map (see Figure 5). From figure 5, we can see this isn’t a good clustering (lacking well-defined groupings). This further confirmed that SOM parameter values are sensitive to the given dataset.
74
Figure 5 Initial clustering results for a set of faculty pages
The researcher interacted with the system and made some adjustments to the clusters. For example, the researcher put faculty member Belkasim and Weeks into the reference set based on her contention that those two members have similar research interests. Based on the user’s recommendation, the faculty pages were re-clustered. The researcher may need to repeat the process several times until a satisfactory answer is reached. Figure 6 shows the clustering result after 2 iterations. The computer generated clustering result was compared to the clustering result produced by a domain expert (a Ph. D student majoring in Computer Information System manually clustered the faculty pages based on faculty research interests, details of the result are presented in appendix A). The CR is 0.4667, the CP is 0.4667, and the F value is 0.4667. This is a reasonably good clustering. In this preliminary study, we simplified the feature set selection process; if the user’s perspective is that of research interests, we just extracted everything in
75
research interests section. Future studies on SOM feature set selection can further improve the SOM clustering performance.
Figure 6. Clustering result for a set of faculty web page after 2 iterations
In another case, the researcher just wanted to find faculty members who had research interests similar to her own with whom she could potentially collaborate. The researcher pointed to her home page as the user profile for “research interests.” Our system then clustered all the faculty pages along with the researcher’s page using typical SOM parameter values (the researcher is annotated as “researcher”). The initial clustering result is presented in Figure 7.
76
Figure 7. Initial clustering result for a set of faculty pages along with researcher page[vv2]
The researcher pointed out a particular faculty (Prof. Belkasim) who had research interest similar to her own. Our system then used this as reference set and re-clustered the whole dataset. The process was repeated two times. The final clustering result is shown in Figure 8. In the figure, 4 faculty members are identified as potential research collaborators. The precision is 44.44% and recall is 66.67%. This is much higher than one obtained through Google search (we applied advanced keywords search as we describe in the introduction section; the precision and recall were both less than 8%). An even better precision and recall is expected with further investigation of SOM feature set selection. This preliminary study showed that our approach can create an adaptive knowledge model on a set of web resources and facilitate the browsing and searching of such resources[vv3].
77
Figure 8. Clustering results for a set of faculty pages along with researcher page after 2 iterations.
5. Discussion The feasibility of creating dynamic and adaptive knowledge models for web-based resources has been demonstrated in the preliminary studies. Knowledge models that can be automatically maintained, reflecting dynamically changing web resources, and providing an adaptive interface tailored to individual viewpoints, are potentially powerful tools. The contribution of this paper falls into following areas: 1) proposing a systematic approach for clustering web-based resources using a SOM algorithm that makes creation and maintenance of the knowledge models for those web sources easier; 2) creating adaptive knowledge for the identified web resources by an interactive process to facilitate user centric browsing and searching. Research remains to be conducted in defining and maintaining user profiles, selecting initial web pages pertinent to the user profile, providing interactive mechanisms to refine those web
78
pages and user profiles, setting keyword thresholds, determining frequency count threshold values of keywords sufficient to categorize pages, and labeling resulting clusters. Continuing the building of a component based prototype tool, coupled with experimental validation provides a good way to approach workable knowledge model solutions, as for example, finding all faculty research web pages across a set of universities and understanding how they might reveal collaborative research opportunities. In our preliminary study, we chose faculty web pages as a domain because it is a relatively focused domain. In the future we will test our approach on web-based resources such as the Society for Neuroscience database gateway, which represents a domain specific knowledge model for neuroscience that is rapidly expanding, dynamic and complex area of scientific knowledge and collaboration. 5. ACKNOWLEDGEMENT This work is partially supported by NSF ITR Grant IIS-0312636, NSF NMI Grant No. ANI0123937, Sun Microsystems Academic Equipment Grant EDUD 7824-010460-US, Georgia State University Brain and Behavior Fellowship program, Georgia State University’s Robinson College of Business, and Georgia State University’s Information Systems & Technology.
References Brusilovsky, P. (2001). "Adaptive Hypermedia." User Modeling and User-Adapted Interaction, Ten Year Anniversary Issue 11(1-2): 87-110. Chang, G., M. Healey, J. A. M. McHugh and T. L. Wang (2001). Mining the World Wide Web, An Information Search Approach. Norwell, MA, Kluwer Academic Publishers.
79
Craven, M., D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery (2000). "Learning to construct knowledge bases from the world wide web." Artificial Intelligence 118(1-2): 69 -114. Crescenzi, V. and G. Mecca (2004). "Automatic Information Extraction from Large Websites." Journal of the ACM 51(5): 731-779. Fairthorne, R. A. (1961). The mathematics of the classification: Towards Information Retrieval. London, Butterwoths. Ferragina, P. and A. Gulli (2005). "A Personalized Search Engine Based on Web-snippet Hierarchical Clustering." Proc. of the 14th International World Wide Web Conference, Chiba, Japan. Flake, G. W., S. Lawrence, C. L. Giles and F. M. Coetzee (2002). "Self-Organization and Identification of Web Communities." IEEE Computer 35(3): 66-71. Gauch, S., J. Chafee and A. Pretschner (2004). "Ontology-based personalized search and browsing." Web Intelligence and Agent Systems 1(3-4): 219--234. Hayes, R. M. (1963). Mathematical models in information retrieval. New York, McGraw-Hill. Holland, J. H. (1992). "Genetic Algorithms." Scientific American(July 1992): 66-72. Jain, A. K., M. N. Murty and P. J. Flynn (1999). "Data Clustering: A Review." ACM Computing Surveys 31(3): 264-323. Kennedy, A. and M. Shepherd (2005). "Automatic Identification of Home Pages on the Web". Proc. of the 38th Annual Hawaii International Conference on System Science, Hawaii, USA. Kohonen, T. (1995). Self-Organizing Maps. Berlin, Springer-Verlag. Larsen, B. and A. Aone (1999). "Fast and Effective Text Mining Using Linear-time Document Clustering". Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining. Liang, J., V. K. Vaishnavi and A. Vandenberg (2006; to appear). "Clustering of LDAP Directory Schemas to Facilitate Information Resources Interoperability Across Organizations." IEEE Transactions on System, Man, and Cybernetics, Part A. Mangiameli, P., S. K. Chen and D. West (1996). "A comparison of SOM Neural Network and Hierarchical Clustering Methods." European Journal of Operational Research 93(2): 402-417. Rauber, A. and D. Merkl (1999). "SOMLib: A Digital Library System Based on Neural Networks". Proc. of the 4th ACM Conference on Digital Libraries, Berkeley, CA.
80
Rauber, A., D. Merkl and M. Dittenbach (2002). "The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data." IEEE Transactions on Neural Networks 13(6): 1331- 1341. Salton, G. (1991). "Developments in Automatic text retrieval." Science 253: 974-979. Schneider-Hufschmidt, M., T. Kühme and U. Malinowski (1993). Adaptive User Interfaces: Principles and Practice. New York, USA, Elsevier Science Inc. Stein, B. and S. M. Z. Eissen (2002). "Document Categorization with Major CLUST". Proc. of 12th Annual Workshop On Information Technologies And Systems (WITS'02), Barcelona, Spain. Teevan, J., S. T. Dumais and E. Horvitz (2005). "Personalizing search via automated analysis of interests and activities". Proc. of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval Salvador, Brazil, ACM Press, New York, NY. Tomiyama, T., R. Ohgaya, A. Shinmura, T. Kawabata, T. Takagi and M. Nikravesh (2003). "Concept-Based Web Communities for Google™ Search Engine". Proc. of the 12th IEEE International Conference on Fuzzy Systems. Van Rijsbergen, C. (1979). Information Retrieval. Butterworth, London. Zeng, H.-J., Q.-C. He, Z. Chen, W.-Y. Ma and J. Ma (2004). "Learning to cluster web search results ". Proc. of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, United Kingdom, ACM Press. Zhao, H. and S. Ram (2004). "Clustering Schema Elements for Semantic Integration of Heterogeneous Data Sources." Journal of Database Management 15(4): 88-106.
Appendix A Domain Expert Clustering Result for a Set of Faculty Web Pages. Name Belkasim Scott Weeks Zelikovsky Zhu Hu Preethy
Cluster Index No. 1 1 1 1 1 2 2 81
Sunderraman Bourgeois Pan Prasad Beyah Li Harrison Zhang King
3 4 4 4 5 5 6 7 8
82