Uncovering the Latent Underlying Domains of a Research Field: Knowledge Visualization Revealed Tsung Teng Chen, Liang Chi Hsieh Graduate Institute of Information Management National Taipei University {
[email protected],
[email protected]} Abstract This paper illustrates how to clarify a new research topic – knowledge visualization – in terms of analyzing a vast amount of citations related to this field. The study presents a method that may help scholars who are interested in an inter-disciplinary and not-well-understood research area, such as knowledge visualization by revealing for them the underlying disciplines that a particular research is based upon. Key papers relevant to the research in question are also identified that may help researchers gain an overview of the field. The method discussed in this paper is generic in nature and generally applicable to other research topics.
1. Introduction Knowledge visualization has been defined [1] as a new scientific discipline that examines the use of visual representations to improve the transfer of knowledge between at least two persons, or group of persons. SemNet [2] is a knowledge visualization system that represents elements of a Prolog rules knowledge base as labeled rectangles connected by lines or colored arcs, which may produce 3D graphic representations of large knowledge bases to help users comprehend complex relationships among rules. The study of knowledge domain visualization aims to improve the understanding of the development of a knowledge domain through a study of its quantitative and qualitative properties [3]. Knowledge domains are represented collectively by articles of a designated research area. Knowledge domain visualization uses citation and cocitation analysis techniques to recognize patterns that may represent research specialties or intellectual groups [3]. There seems no consensus among researchers in terms of how to define the context and content of knowledge visualization research. We describe an approach to uncover the context and content of a relatively new research
Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006
IEEE
discipline, knowledge visualization in this case, through analyzing the literature and citations in this domain.
2. Related works An extensive domain analysis of the information science discipline has been presented in terms of its authors [4] . In this study, the names of those most frequently cited authors in 12 key journals from 1972 through 1995 were retrieved from the Social SciSearch via DIALOG. The top 120 authors were analyzed via author co-citation analyses, which yielded automatic classifications relevant to histories the field. The paper’s results revealed two interesting aspects of the knowledge visualization field: (1) the specialty structure of the discipline over 24 years; (2) evidence of a paradigm shift in information science during this period. However, the study only selected a list of key journals, 12 in this case. The study may have presented a different picture had it included most, if not all, the representative publications in this field. In order to better understand the topic of knowledge domain, Garfield [5] created a series of databases related to knowledge domains from searches of the ISI citation indexes (SCI, SSCI, and AHCI). The data collected are applied to generate chronological maps of subject collections by using a software package called HistCite, which also highlights the most-cited works in and outside the data collected. Garfield conducted a general search for papers on “Knowledge and Domain” via Web of Science and found 280 papers with the keywords “Knowledge” and “Domain” in the title. The citation metrics of these 280 papers were tabulated based on their global citation counts and local citation counts, local counts representing the number of citations a paper received from the 280 papers. The global count is the overall number of times an article is cited by other papers. Garfield illustrates the problem of conducting a search based solely on the terms “knowledge” and “domain” because only 8 papers were cited locally three or more times in a collection of 280 papers. The most
cited paper by the local collection of 280 papers was published in the Review of Educational Research in 1988; the title of the article is “The Interaction of DomainSpecific and Strategic Knowledge in Academic Performance”. Garfield therefore concluded the terms used in this search are inadequate to the task and expanded the terms employed in subsequent searches. In order to capture literature relevant to the “mapping of knowledge domains,” five different data sets covering the topics of Information Visualization, Dynamic Systems, Co-Citation, Bibliographic Coupling, and Scientometrics (the journal) were adopted to collect data sets and the resulting data were merged to create a multi-domain collection. About 3,600 papers were pruned out from approximately 6,000 papers based on citation and other criteria. The two most cited papers among these 3,600 were two books discussing the issue of visual display and information envisioning authored by Edward Turfe. The papers’ collection resulting from this expanded search includes articles authored by Small (co-citation analysis) and Kessler (bibliographic coupling). The key issue of Garfield’s approach is finding a set of subject terms that fully cover the initial key terms – “Knowledge Domain” in this case. Chen and Paul [6] proposed a four-step procedure to visualize a knowledge domain’s intellectual structure. First, authors whose work has received citations above a predetermined threshold are selected. The co-citation frequencies for these authors are calculated from a citation database and stored in a frequencies matrix. A correlation matrix of Pearson correlation coefficients is computed from the matrix of co-citation frequencies. Second, Pathfinder network scaling [7] is applied to the network that the correlation matrix defines. Third, factor analysis is used to identify intellectual grouping which is then overlaid to the interconnectivity structure obtained in step two. Authors belonging to the same specialty should appear as a closely connected group in the network. Finally, the citation impact of each author is displayed atop the intellectual groupings. The height of a citation bar represents the magnitude of the impact. In Chen and Paul’s study, a collection of 10,292 articles in IEEE Computer Graphics and Applications magazine for a period of 18 years were used to illustrate this four-step process. Three hundred and fifty three authors received more than five citations in this collection, and these were used to build up the author co-citation network which was stored as a 353 by 353 matrix. 28,638 arcs in the original citation network were scaled down to 355 arcs by applying Pathfinder network scaling. This type of procedure is universally applicable in identifying specialties from co-citation data. However, the specialties identified with this procedure need to be decoded and elaborated by human readers. The citation data sources are pre-selected from established sources which may not include all important data sources. It could be argued that the five citations threshold, used in pruning out insignificant authors, is selected rather
Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006
IEEE
arbitrarily and this may induce bias in the derived intellectual structure. The computational complexity of the Pathfinder network is O(N2) [7] which in practice limits the scale of the co-citation matrix to a few hundreds.
3. Uncovering the latent underlying domains of a research field 3.1. The scheme We applied the pruning scheme developed by Chen and Xie [8] on a complete data set collected from the CiteSeer citation index system [9]. The pruning method, which consists of two steps, is used to trim the complete citation graph. One threshold value is selected and applied in each step respectively. The first step filters out all the nodes with the in-degree value less than the first threshold value. The purpose of the first step is to build the highly cited initial core set of nodes. The second step checks the in-degree value of the nodes of those papers that are citing or being cited by the papers in the initial core set. Those papers with an in-degree value higher than the second threshold will be added to the initial core set. The procedure terminates when all the nodes in the core set are processed and no new nodes can be added to the core set. The complete citation graph was built from the literature retrieved by querying the term “Knowledge Visualization” from CiteSeer on October 20, 2005. The complete graph contains 191,357 document nodes and 394,621 citation arcs. Two iterations of pruning that use two different sets of threshold values are carried out in this study to reveal the research context of knowledge visualization with finer granularities. The application of different threshold values in this process is meant to prune out less cited articles with ascending thresholds, which in turn breaks up a crammed graph into a number of disjoined weak components. A weak component in the graph is considered a distinct research topic that provides supporting knowledge or relevant information to the designated research domain. This procedure helps us visually uncover the underlying contexts of the domain. Two value sets (90, 45), and (180, 90) are used in the experiment discussed in the following section.
3.2. Experiments The citation graph results were derived by applying the threshold values of 90 and 45, as shown in Figure 1 below. The nodes with yellow color represent authoritative sources [10]. The red-colored nodes stand for hubs. Green represents nodes that play a dual role as the authority and hub. The size of nodes signifies the relative importance of an authoritative article. There are 176 document nodes and 442 links in Figure 1. The 8 most authoritative papers in
Figure 1 are listed in Table 1. There are four disjoined weak components in Figure 1, these components and their corresponding papers are listed in Table 2.
Figure 1 Weak components in the weighted authoritative citation graph with 90/45 threshold values Table 1 List of papers in the main component in Figure 1 with descending rank Cited
No. 7
Global Cited 878
url=http://citeseer.ist.psu.ed u/+ agrawal93mining.html
27 19 18
6 86
869 468
agrawal94fast.html context/229/0
12
111
113
context/70319/0
11 10 10 9
30 98 93 48
261 207 111 185
srikant95mining.html context/15167/0 nebel90terminological.html context/24591/0
Compo nent
Node 35
Title of Paper Mining Association Rules between Sets of Items in Large Databases (1993) Fast Algorithms for Mining Association Rules (1994) An overview of the KL-ONE knowledge representation system (1985) The tractability of subsumption in frame-based description languages (1984) Mining Generalized Association Rules (1995) Attributive concept descriptions with complements (1991) Terminological Reasoning is Inherently Intractable (1990) Reasoning and Revision in Hybrid Representation Systems (1990)
Table 2 List of papers in the disjoin weak components in Figure 1 url=http://citeseer.ist.psu.edu/+ Title of Paper sarkar92graphical.html
Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006
IEEE
Graphical Fisheye Views of Graphs (1992)
2
3 4
36 37 176
context/40987/0 sarkar93graphical.html context/38196/0
168
ahlberg94visual.html
169
alhlberg92dynamic.html
170
context/17968/0
78 79 160
clark89cn.html quinlan87simplifying.html farquhar96ontolingua.html
161
context/17004/0
Generalized Fisheye Views (1986) Graphical Fisheye Views (1993) The perspective wall: Detail and context smoothly integrated (1991) Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays (1994) Dynamic Queries for Information Exploration: An Implementation and Evaluation (1992) Designing the User Interface: Strategies for Effective HumanComputer Interaction, Addison-Wesley Publishing Company, by Shneiderman, B. (1987) The CN2 Induction Algorithm (1989) Simplifying Decision Trees (1986) The Ontolingua Server: a Tool for Collaborative Ontology Construction (1996) A Translation Approach to Portable Ontology Specifications (1993)
Figure 2 Weak components in the weighted authoritative citation graph with 180/90 threshold values The two most cited papers in table 1 are papers in the data mining field that discuss the effective algorithms for mining association rules from large databases. The paper ranked number 3 and 4 are in the field of Artificial Intelligence (AI) and they discuss the KL-One knowledge representation system and frame-based descriptive system respectively. Three of the four remaining papers are AI related articles; the one paper left discusses mining association rules.
Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006
IEEE
Three papers out of four listed under component one in table 2 discuss the graphical fisheye views. Articles in component two are all in the fields related to visualization, which include user interface design, visual information seeking, and the graphical visualization of databases. Papers in component three were published in Machine Learning and equivalent journals, they are AI related papers. Papers in component four characterized recent developments in the area of Ontology related studies.
The citation graph resulted from applying the threshold values of 180 and 90 is shown in Figure 2 above. There are 37 document nodes and 51 citation arcs in Figure 2. The three most authoritative/cited papers, interestingly, are all papers in the data mining field that discuss the effective algorithms for mining association rules from large databases. We therefore conclude that the most researched area related to knowledge visualization is data mining. There are four disjoined weakly connected components in Figure 2: (1) component with node number 33, 18, and 19; (2) component with node 13 and 14; (3) component with node 34 and 35; and (4) component with node 36 and 37. Based on Chen and Xie [8], these weak components may represent research focuses which diverge from the main research theme – mining association rules. Component one includes papers in the research field of artificial intelligence. Articles and books in component two discuss the visualization of information. Papers and books in component three discuss data clustering and pattern classification. Component four includes two papers discussing the issue of efficient multidimensional data aggregation, which is a very important issue for OLAP applications. As we can see, papers in these four components belong to four distinct research fields that are all related to the study of knowledge visualization.
3.3. Documents similarity analysis The documents belong to the same disjoin component are dealing with the same or a closely related research topic. Chen and Xie [8] proposed this intuitive appealing argument above without further explanation. We bolster this argument by applying document similarity function [11] to validate the argument that the content similarity implies research fields proximity. We use the most cited document listed in Table 1 as the basis for comparison with other documents listed in Table 2. Documents are included in the comparison if they are available from CiteSeer. Titles in component 0 are equivalent to papers listed in Table 1, other component numbers correspond to that of Table 2. The result is listed in the table below: Table 3 Similarity Scores Title of Paper Score Component Mining Association Rules 0.4595601 0 between Sets of Items in Large Databases Fast Algorithms for Mining 0.40999863 0 Association Rules Mining Generalized 0.38664725 0 Association Rules Simplifying Decision Trees 0.20479295 3
Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006
IEEE
Terminological Reasoning is Inherently Intractable Dynamic Queries for Information Exploration: An Implementation and Evaluation Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays The Ontolingua Server: a Tool for Collaborative Ontology Construction Graphical Fisheye Views Graphical Fisheye Views of Graphs
0.13106923
0
0.08890509
2
0.08768359
2
0.082507
4
0.06768221 0.06295684
1 1
The similarity scores are calculated by a home grown similarity analysis system based on Lucene [12], which is an open source implementation of the TF×IDF similarity function. The closer the score, the more similar these documents are. We can see from Table 3 that all documents are clustered together based on their similarity scores except one document in component three. Comparable results have been observed if a document from a different component is chosen as the basis for comparison. A score is calculated from four portions of a paper: title, abstract, keyword, and content.
4. Discussions The high thresholds we have applied help us to discover the most researched topics with a general query term. However, a newer research context may be filtered out by this process due to infrequent citation. Fewer citations are generally received by literature published recently. Conversely, a mature research area may also be filtered out due to sporadic citation. We compared the data resulted from the two different sets of threshold values and summarized the findings: (1) the main research theme seems stable and stayed on data mining related issues; (2) mature research topics such as fisheye view lens was filtered out by the higher threshold values; (3) relative newer research areas like ontology was uncovered by lower threshold values; (4) evolutionary research areas such as visualization stayed on a evolving course. To elaborate the findings discussed above, we use the research of fisheye view as an illustrative example. A fisheye view lens is a very wide angle lens that magnifies nearby objects while shrinking distant objects [13]. It is a helpful tool for seeing both “local detail” and “global context” simultaneously. The early research in fisheye views was adopted widely, and it is commonly used in
the visualization field nowadays. The fact that the fisheye view lens is a general tool and no longer a research topic, which may explain the reason why papers in this field were cited less frequently in recent years and were filtered out by applying higher threshold values. The in-degree and out-degree distribution of the nodes in the citation graph follow the power-law distribution [14]. The threshold values could be set as the function of the power-law curve of the in-degree distribution of the citation graph [8]. For example, the first threshold is set as the value x1 where the cumulative probability distribution function of the power-law curve P(x1) = 0.25%. The second threshold should be set with higher cumulative probability value, such as x2 where P(x2) = 1%. Currently, there is no definite guideline for choosing the threshold values. However, there is a general guideline that may be followed: (1) the difference between threshold values in a set should be significant enough; and (2) sets with significantly different threshold values should be applied to enabling the display of graphs with varying granularities.
[3]
[4]
[5]
[6]
[7]
5. Conclusions [8] The context and content of the study of knowledge visualization are not well understood due to disagreeing definitions or propositions from different scholars. The diverse views of this research field may be clarified by analyzing the vast citation information available in this field. The procedure we have applied uncovered the underpinning disciplines of the research field of knowledge visualization, which include knowledge representation, data mining, graphical user interface, information visualization, graph visualization, data clustering, and, more recently, studies in Ontology. In addition, out-of-favored research due to maturation or other reasons as well as new research subjects can be identified due to their relatively infrequent citations. The procedure we have devised is generally applicable in clarifying research domains and their underlying supporting domains. Unlike other approaches reviewed earlier, there are little prerequisites or limitation in effectively utilizing this procedure.
[9]
[10]
[11]
[12] [13]
References [1]
[2]
R. A. Burkhard, "Learning from Architects: The Difference between Knowledge Visualization and Information Visualization," in Proceedings of the Information Visualisation, Eighth International Conference on (IV'04) - Volume 00: IEEE Computer Society, 2004. K. M. Fairchild, S. E. Poltrock, and G. W. Furnas, "SemNet: Three-Dimensional Graphic
Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006
IEEE
[14]
Representations of Large Knowledge Bases," in Cognitive Science and its Applications for Human-Computer Interaction, R. Guindon, Ed. Hillsdale NJ: Lawrence Erlbaum Associates, 1988. K. Börner, C. Chen, and K. Boyack., "Visualizing Knowledge Domains," in Annual Review of Information Science and Technology, vol. 37, B. Cronin, Ed. Medford, New Jersey: American Society for Information Science and Technology, 2002, pp. 179-255. H. D. White and K. W. Mccain, "Visualizing a Discipline: an Author Co-Citation Analysis of Information Science, 1972-1995," Journal of the American Society for Information Science (19861998), vol. 49, pp. 327, 1998. E. Garfield, "Historiographic Mapping of Knowledge Domains Literature," Journal of Information Science, vol. 30, pp. 119, 2004. C. Chen and R. J. Paul, "Visualizing a Knowledge Domain's Intellectual Structure," Computer, vol. 34, pp. 65-71, 2001. C. Chen, "Generalised Similarity Analysis and Pathfinder Network Scaling," Interacting with Computers, vol. 10, pp. 107-128, 1998. T. T. Chen and L. Q. Xie, "Identifying Critical Focuses in Research Domains," presented at Proceedings of the Information Visualisation, Ninth International Conference on (IV'05), London, 2005. K. D. Bollacker, S. Lawrence, and C. L. Giles, "CiteSeer: an Autonomous Web Agent for Automatic Retrieval and Identification of Interesting Publications," in Proceedings of the second international conference on Autonomous agents. Minneapolis, Minnesota, United States: ACM Press, 1998. J. M. Kleinberg, "Authoritative sources in a hyperlinked environment," Journal of The ACM, vol. 46, pp. 604-632, 1999. G. Salton, J. Allan, and C. Buckley, "Automatic structuring and retrieval of large text files," Commun. ACM, vol. 37, pp. 97-108, 1994. E. Hatcher and O. Gospodnetiü, Lucene in Action. Greenwich: Manning Publications Co., 2004. M. Sarkar and M. H. Brown, "Graphical fisheye views of graphs," in Proceedings of the SIGCHI conference on Human factors in computing systems. Monterey, California, United States: ACM Press, 1992. Y. An, J. Janssen, and E. Milios, "Characterizing and Mining the Citation Graph of the Computer Science Literature," Dalhousie University, Technical Report CS-2001-02, 2001.