Document not found! Please try again

Domain visualization for digital libraries - Information ... - IEEE Xplore

6 downloads 1915 Views 1MB Size Report
In this article, the domain visualization is linked to various citation-related information of scientific literatures on the Web through. NEC 's Researchlndex system.
Domain Visualization for Digital Libraries

Chaomei Chen The VIVID Research Centre Department of Information Systems and Computing Brunel University Uxbridge UB8 3PH UK Email: [email protected] question, experience of search, and the familiarity of the search space. The so-called vocabulary mismatch problem refers to the ambiguity of term selections and the meanings assigned to the same word by people with various background. Therefore, it would be particularly helpfbl in practice if users can have a wider range of options in forming queries of different nature. For example, it is common to see search functions designed to allow users to specify the search scope in terms of languages, media, years of publication and other extrinsic attributes. In this article, a major motivation of the work is to exploring the use of domain visualization as an alternative means of accessing intellectual structures as well as scientific literature of a given subject domain. Domain visualization relies on various analysis and modeling of a subject domain, drawing up communicative structures, trends, and patterns identified in quantitative and qualitative data. Domain visualization aims to reveal the most significant and profound intellectual structure associated with scientific research in a subject domain. The pioneering work includes Atlas of Science [ I ] and more recent work [2, 31 at the Institute for Scientific Information (ISI), author co-citation analysis of information science [4] and hypertext [ 5 ] . This article describes domain visualization based on author co-citation analysis of data from the Science Citation Index. This domain visualization not only enable researchers, scientists, and learners to explore and keep track of the development of knowledge in a specific domain, but also provides an interface that facilitate the search of scientific literature on the Web. This article is organized as follows, First, we introduce related work in domain visualization and digital libraries of scientific literature. Second, we describe how the work extends our previous research in author co-citation analysis, especially in revealing the structure of a highdimensional space of specialties. Then, we illustrate how this domain visualization can be linked to an independent search engine of citation, developed by NEC Research

Abstract Domain visualization aims to reveal the most signtficant intellectual structure associated with a subject domain. This article illustrates how domain visualization enables novel ways of accessing scientijk literatures in digital libraries. The domain of computer graphics is visualized based on a citation analysis of articles appeared in the prestigious IEEE Computer Graphics and Applications over I8 years (1982-1999). The derived high-dimensional domain structure is uncovered through the use of various visualization and animation techniques. Author co-citation maps open up new ways of visual information retrieval. Not only can users explore domain visualizations as a virtual landscape, but also invoke queries directly to a digital library. In this article, the domain visualization is linked to various citation-related information of scientific literatures on the Web through NEC 's Researchlndex system.

1. Introduction Understanding and tracking the development of scientific knowledge is a crucial part of scientific activity. It is an increasingly challenging for students, researchers, and even domain experts to keep track of the vast amount of scientific publications in the information age. A wide variety of users need tools that can help them to synthesis relevant scientific literature as a whole as well as effectively retrieve specific publications. In this article, we describe our research in domain visualization aiming to provide users with a wider range of ways to explore scientific literature and make sense of long-term trends in the development of scientific knowledge. 'Traditional information retrieval typically uses a query ofikeywords as a starting point. One can refine queries subsequently based on the intermediate results of retrieval. It is well known that keyword-based query formation is very vulnerable, relying on the knowledge of the subject in

0-7695-0743-3/00 $10.000 2000 EEE

26 1

Institute, to enable users to search scientific literature from different perspectives. Implications on digital libraries are discussed.

search the database on various bibliographic attributes of a document such as the author, article title, and journal title, but also use the citation context hnction to retrieve excerpts from citing documents and find out the context of individual citations. This is an invaluable tool for researchers to judge the nature of a citation. In order to enable users to search citation databases more actively, one must develop various query-formation aids to help users to take advantage of being able to form a multiple facet query. However, to hlly benefit from the facility of searching scientific literature on the Web using the name of an author as a starting point, one must have additional in-depth knowledge of researchers in a particular subject domain. Currently, such information is not yet readily available.

2. Related Work The notion of association is a broad and widely used concept in digital libraries, information retrieval, and hypertext. For example, the strength of association between two documents is traditionally determined based on patterns derived from word-frequencies. One can define other types of association, such as co-citation and state transition models [5, 61. A central idea in the legendary Memex is trailblazing, i.e. making connections, in a vast information space [7]. Bibliographic citations play a similar role in shaping the scientific literature and to some extent the scientific community itself. In order to improve the integrity and accessibility of scientific literature, technologies have been developed for tracking the growth of scientific disciplines. Two remarkable examples are classic citation indexing and autonomous citation indexing. Citation indexing is well-established with ISI's long-lasting efforts, who produces Science Citation Index (SCI) and Social Science Citation Index (SSCI). Autonomous Citation Index, developed by researchers at NEC, is a promising Web-based approach.

2.1

2.3

Author co-citation analysis (ACA) aims to discover intellectual structures from author co-citation data [4]. In ACA, instead of articles or journals, individual authors are used as data points in the literature. ACA uncovers how authors, as domain experts, perceive the interconnectivity between published works. An in-depth author co-citation analysis was reported by White and McCain [4] in 1998. They analyzed the domain of information science based on author co-citation data drawn from 12 key journals in the field. Their work demonstrates the strength and potential of ACA, although it is still a time-consuming process. Inspired by this landmark work, we have been exploiting the potential of ACA in strengthening and augmenting domain visualization. In this article, we will illustrate how an intellectual structure of a subject domain can be derived with ACA. In addition, we will show how co-citation maps of authors provide a new way of organizing and accessing digital libraries.

Classic Citation Indexing

Originally, citation indexing was motivated to break the barrier in subject indexing. By drawing upon the profound practice of citation in scientific literature, it is intended that citation indexing can reveal the underlying, intrinsic structure of the body of scientific knowledge. IS1 has devoted to represent the structure of scientific literature for many years. Its pioneering Atlas of Science [I], based on document co-citation analysis, revealed the macro-structure of disciplines such as biochemistry and molecular chemistry. Recently, IS1 is increasingly interested in the applicability of visualization technologies for mapping science as a whole [2].

2.2

Author CO-Citation Analysis

3. Domain Visualization In this article, domain visualization is presented through the analysis and modeling of the field of computer graphics. All the articles appeared in one of the core computer graphics journals, IEEE Computer Graphics and Applications (CG&A), are included in the analysis. On the one hand, the present study aims to reveal the intellectual structure of information visualization. On the other hand, the study is designated as a starting point of a longitudinal process of tracking the development of the domain knowledge. Key journals and conference proceedings identified in this study, such as ACM Transactions on Graphics, will be incrementally added to our database in subsequent studies.

Autonomous Citation Indexing

Recently, researchers at NEC Research Institute have developed a Web-based citation database system called Researchlndex', which allows users to search for various citation details of scientific documents on the Web [SI. Researchlndex currently supports an experimental database for computer science, which already contains hundreds of thousands of records. Not only can users

' Formerly known as CiteSeer. 262

3.1

Data Collection

The bibliographic data were drawn from the SCI database. IEEE CG&A was launched in 1981. The SCI database covers articles published in the journal over the past IS years, fiom the second volume in 1982 through the current volume in 1999. Other types of publications in the journal, such as editorial and report, were excluded from the analysis. The sample citation data refer to a total of 10,292 unique articles together with the names of 5,312 unique first authors. The data revealed a total of 1,820 citing authors. A total of 353 authors who have been cited for more than 5 times in CG&A were selected. A domain landscape was subsequently derived from the co-citation profiles of the 353 top-sliced authors as if 1,820 CG&A authors rated the proximity of each pair of authors by their citations.

3.2

Identifying Specialties

Factor analysis was based on the co-citation profiles of 353 authors. A principal component analysis (PCA) was applied to the 353-by-353 matrix of authors in order to identify salient dimensions that can characterize these authors. In information science, these dimensions are called specialties, which represent sub-fields of a subject domain. In order to determine the nature of individual specialties, we examined the context of citations associated with authors who have predominant profiles in each specialty. We used Researchlndex to retrieve the context of citations for authors with the highest factor loading in each factor. For example, because Whitted had the highest factor loading in the first factor, we searched his name in Researchlndex. His most frequently cited work over the Web was his CACM article in 1980, which has been cited for more than 50 times on the Web. It echoed its profile in IEEE CG&A. We examined a series of citations to this article and determine the nature of this author’s contribution. The profiles of leading authors in each specialty were established in this way. The specialty was named after the most common characteristics identified in the citation context. The following excerpts from Researchlndex illustrate this process for Whitted’s article: Whitted, T. An improved illumination model for shaded display. Comm. ACM 23 (1980),

3.3

Mapping the Knowledge Domain

Author co-citation networks were submitted to Pathfinder network scaling in order to simplify and highlight the most essential relationships of the subject domain. We embedded the Pathfinder networks into a 3D virtual world and then superimposed additional information so as to present a domain landscape of the domain to users. We included a time series of citations as colored stacked bars and depicted the results of factor analysis using factor-loading-based color encoding. Users can zoom in and out in the virtual domain landscape to explore its structure and content.

343-349. - .- - ..

Details Correct - Ray Tracing with MetaHierarchies James Arvo - Apollo Systems Division of Hewlett-Packard - 300 Apollo Drive - Chelmsford,

-

MA 01 824 [email protected], [email protected]

263

The 3D landscape reinforces the semantics of the author co-citation network. For example, authors located towards the center of the network are likely to have made a profound impact to the field, whereas authors located in remote areas to the center are more likely to be known for their special and unique works. On top of the author co-citation network, the citation time series of each individual author over the 18-year span was color-mapped to a stacked bar. The length of each section of the stacked bar, displayed in a unique color, corresponds to the frequency of citations of the author in a particular year. Citations in recent years are in bright yellow, and citations in earlier years are in darker colors. By looking at the color towards the top of an author's citation bar, one would be able to tell whether this author's work is currently in its peak or its peak has passed. Domain analysts will be able to identify rising stars, those authors with an extended range of bright colors in their citation bars, or falling stars, whose citation

bars are largely covered by dark colors. In addition, such domain landscapes allow users directly access bibliographic and full text digital libraries.

4. RESULTS 4.1

Author CO-Citation Analysis

Factor analysis identified 60 sub-fields of the domain according to the 353 authors' citation profiles. Two of the most predominant factors explained 13% and 11% of the variance, respectively, whereas the top 5 factors accounted for 39% of the variance. In order to determine the nature of the major factors, we listed the names of 10 most predominant authors for each top-5 factor (see Table 1). We then examined the citation profiles of these authors to establish why they were cited.

Table 1. Factor membership. Top 10 authors ranked by factor loading of five most predominant factors, explaining 39. 1% of variance in the author co-citation profiles.

4.2

popular blobby model. This factor, therefore, indicates a specialty that includes pioneering works in rendering and computer-generated images.

Specialties in Context

The nature of five most predominant specialties was identified according to citation contexts for leading members of each specialty.

4.2.2. Computer vision. The second specialty is labeled as computer vision. This group includes a number of pioneering researchers in computer vision, including Tilove's set membership classification on intersection problems and detecting collision paths. Search on ResearchIndex has revealed a total of 263 citations of Ballard and Brown's book, entitled Computer Vision. These authors' works represent some typical research in computer vision.

4.2.1. Rendering and ray tracing. The first specialty is predominated by several well-known computer graphics scientists in classic rendering techniques, including Whitted's illumination model for ray tracing, Williams' classification of level of details, the Cook-Torrance lighting model, and the famous Phong shading model. Whitted's original article in 1980 is ranked as the 2"dmost cited work for a total of 30 times in the IEEE CG&A data. One of the most cited authors in IEEE CG&A, James Blinn, also appeared in this group. He has received a total of 91 citations, ranked as the third. Blinn's seminal work on implicit surface modeling is known through his

4.2.3. Geometric modeling and computer-aided design. Several names in the third largest specialty are related to the concept of spline. Splines are curve and surface representations defined with piece-wise polynomial

264

functions. This group is dominated by the presence of computer-aided design and geometric modeling. Bezier's name appears in the top 5 of this group, whose name has been associated with a number of fundamental concepts in geometric modeling, including the famous Bemstein-Bezier patches and Bezier clipping. The Bemstein-Bezier representation has been extremely useful in characterizing multivariate spline bases. Bezier Clipping is an iterative clipping method which takes advantage'of the convex hull property of Bezier curves, and clips away regions of the curve that does not intersect with the surface. Bezier surfaces defined over a triangular domain are often called Bezier triangles. It was Sabin who first generalized Bezier to triangular B-splines. Sabin is on the top of the membership list according to factor loading on this factor.

the computation of the length of a shoreline. The shoreline becomes longer and longer as the resolution of the map increases because one must account for every new visible creek for higher resolutions. Zipfs Law, which is also explained in Mandelbrot's book, qualifies that the degree of popularity is exactly inversely proportional to rank of popularity. A number of citations to Mandelbrot's work applied this law to the modeling of behavioral patterns such as accessing documents on the Web. The relatively predominant position of fractals in factor 5 suggests that this specialty shares some common interests in modeling of the nature, especially human figures, facial expressions, animal behavior, coastlines, and mountains.

4.3

Mapping the Knowledge Domain

4.2.4. Image processing. Ray tracing and direct projection methods are the basics for generating directly rendered images. In ray tracing, viewing rays are sent through each pixel and integrated through the volume. In direct projection, each cell of the volume is projected onto the screen. In contrast to the ray-tracing specialty identified earlier, this image processing specialty corresponds to a group of researchers often associated with direct projection. The first two names ranked in this factor, Dan Gordon and Gideon Frieder, were frequently cited for their work in direct project. This specialty also includes researchers in image processing, especially in processing brain images and contour interpolation. Interpolation between contours is a general problem in computer graphics. This problem has been dealt with by algorithms that address the correct mapping or correlation of contour points at one level with those at an adjoining level. This specialty is therefore characterized as image processing.

4.4

Author Co-Citation Network

First, we used factor loading of the three most significant factors to color the author co-citation network (see Figure 1 frame 0). The central area is in red, which corresponds to the most predominant factor. The green area to the left indicates the second most predominant factor of the field. The blue area to the right represents the third factor. Note that the entire network has not been completely covered by colored areas, suggesting that authors outside the colored areas must be characterized by high-dimensional factors.

4.5

Animation of High-Dimensional Specialties

In order to reveal the high-dimensional nature of the field, authors in each specialty were highlighted in turn in a sequence of animated frames (see Figure 1 frame 1-5). Areas associated with the current specialty will become brighter. The glowing area helps users to identify the position of a specialty in the global context of the entire network.

4.2.5. Modeling the nature. The fithlargest specialty is related to the modeling of the nature, including human figures, facial expressions, flocking behaviors of birds, coastlines, and mountains. For example, the Jack humanbody animation system, developed at the University of Pennsylvania's Center for Human Modeling and Simulation, is a landmark work in computer graphics and especially in the simulation and animation of realistic movements of human figures. Parke's parametrized facial modeling approach is also a predominant player in this specialty group. This specialty group also includes Reynolds' pioneering work in simulating flocking behavior in birds. Research in related areas has frequently referred to this model. Fractals, coined by Mandelbrot, have been used to describe spiky, irregular or variegated objects, such as coastlines, mountains, and crystals. A common example is

Figure 1. An animation sequence with areas of specialties glow one after another.

265

4.6

organize a digital library such that one can browse through the network of authors and retrieve articles by the same author (see Figure 4). In addition, because the author co-citation map is derived from a trustworthy source IEEE CG&A, it provides a reasonable reference framework to approximate the underlying intellectual structure reflected by scientific literature on the Web. Domain visualization enables users to search Researchlndex, a citation-oriented search engine, more intuitively (Figure 5). Users can select an author directly from the author co-citation map, or in fact any author with the same color of specialty, and pass over to ResearchIndex as a dynamic query. Users can further explore the work by the same author as well as the view of others in citation context.

Domain Landscape

Figure 2 is a screenshot of the domain landscape resulted from the analysis of IEEE CG&A model and two models generated earlier on ACM Hypertext conference proceedings and ACM CHI conference proceedings. The landscape model integrates several significant structures such as author co-citation networks, citation time series, high-dimensional specialties. The 3D landscape invites users to explore trends and peaks of citations, clusters of authors, and shortest paths connecting two different areas.

Figure 2. Domain visualization of computer graphics based on IEEE CG&A.

Figure 3 shows two elevation models of the specialty space. The upper model is colored by corresponding factor loading found with each author. The lower one is colored by the total number of citations received. The highlighted pair of points in the two models corresponds to an author in the first specialty.

Figure 4. Click on “Whitted-T” in the map and examine a list of articles by the same author.

Figure 3. Elevation models of a specialty space.

Figure 5. The author co-citation map enables users to search Researchlndex by author and by specialty.

Once we have an author co-citation map of a subject domain, it opens a variety of potentially useful ways to organize scientific literature. For example, one can

266

IEEE CG&A as a prestigious journal reflects significant aspects of computer graphics. Of course, it is not the only journal in the field. There are a vast amount of publications in the literature on this subject. In order to gain a comprehensive picture of the subject domain, one must consider other influential sourccs, such as journals and conference proceedings. As these relationships within a discipline become clearer, one can expect that mapping scientific disciplines will result in usefiil roadmaps to guide researchers to locate expertise as well as individual publications. The practical use of the approach has implications on a variety of research fields, including digital librarics, info miat ion visual izat ion, c i tat ioti ana I y s i s, and doniai n analysis.

5. Discussions Domain visualization driven by author co-citation analysis not only provides a medium for users to explore the intellectual structure of a subject domain, but also offers new ways of organizing and accessing scientific literature in digital libraries and the Web. Domain analysts can follow hyperlinks from authors in a domain visualization model to corresponding citation contexts in Rcse~rrchlndex and exa in i ne first-hand i t i for in at ion concerning the contcxt of citations. A domain visualization model can also provide users an information retrieval gateway so that one can access scientific publications via a network of authors. Citation analysts can study both citation and co-citation patterns directly within the visualization model. For new researchers, the persistent domain visualization model will give them a standing point to jump-start their own mental image of the field or as a reference framework to form their own understanding of the knowledge domain.

7. Acknowledgements 'I'lie author would like to acknocvlcdgc tlic fiinding from the 1, i b rary and I ti for nia t io ti C o m m i ss io 11 and t lie 13 r i t i s I i research counc i I t

Suggest Documents