Information Searching Preferences and Practices of Computer Science Researchers Sally Jo Cunningham Department of Computer Science University of Waikato Hamilton, New Zealand Voice: 64-7-838-4402 Fax: 64-7-838-4155 E-mail:
[email protected]
Lynn Silipigni Connaway Library & Information Services Dept. University College University of Denver Denver, CO 80208 Voice: 303-871-3352 Fax: 303-871-4877 E-mail:
[email protected]
We present preliminary findings of an ongoing study of the ways that computer scientists seek, use, and store information when conducting research. Their preferred methods of information foraging has implications for the design of information retrieval systems for these researchers. Traditional indexing schemes based on controlled vocabularies see little use. Researchers rely heavily on browsing and citation searches — information gathering techniques that are not well supported by existing indexes and retrieval systems. Not surprisingly, resources that can be immediately accessed from the user’s office (particularly via the Internet) are preferred to those requiring a special trip to another location (such as a university library).
scientists in particular, seek and use information. Most previous work in information retrieval has approached the issue of information retrieval system useability and usefulness by evaluating the interfaces of existing IR systems (for a comprehensive survey of this work, see [7]). In this paper, we approach the problem from the opposite perspective: given a group of information users (here, computing researchers), how do they prefer to gather and organize the documents they use in their work? By examining current practises in information gathering, we can learn how to better tailor information retrieval systems to the needs of these users. We have based our work on earlier studies of humanities scholars [10]. Like this earlier work, we are conducting a qualitative, exploratory study, utilizing a small number of face-to-face interviews.
1. Introduction
2. Methodology
When searching for information, the computer science researcher has a large number of sources to explore: general science resources (such as the Science Citation Index); bibliographies limited to CS and related fields (such as the ACM Guide to Computing Literature and Inspec); Internet-accessible document repositories (for example, the New Zealand Digital Library [12] and the Unified CS Technical Report [11); communication tools, such as email discussion lists and USENET News groups; and abstracting and reviewing services (such Computing Abstracts and Computing Reviews). Given the amount of effort and resources required to build and maintain these information services, surprisingly little is known about how scientists in general, and computer
We conducted semi-structured, face-to-face interviews of 19 computer science researchers, at two universities in New Zealand. Interviews lasted between thirty minutes and one hour, and were taped for later analysis. The questions were intended to provoke descriptions of: the training that the researcher received in searching the CS research literature; the types of information collections and resources that the researcher personally used, or was aware of; the researcher’s preferred searching techniques; and the method used to organized retrieved documents for later use (see Appendix). Summary demographic information on the study subjects is presented below. While the sample size is too small to support statistical analysis of responses as grouped by these demographic values, this information does indicate that our
Abstract
subjects include faculty members at all stages in their careers. Time spent as a faculty member was used as an approximation of the subjects’ span as an active researcher, and ranged from six weeks to 26 years. While all participants currently work in New Zealand, their academic backgrounds are diverse (receiving their research training in New Zealand, USA, UK, Europe, and Australia). The range of specialties is similarly large, including both hardware and software research, and both “soft” fields (such as computing education and cognitive science) and “hard” (such as formal methods and computing architectures). The majority of our subjects were male, mirroring the gender imbalance typical of computer science departments. gender: male 17
female 2
highest academic degree: PhD 14
Master’s 4
Bachelor’s 1
academic title: lecturer 12
snr. lecturer 5
prof. 2
years as a faculty member 11 7
Interview responses A number of common themes emerged from the interviews: •
bibliographic training: none of the subjects reported receiving formal training in conducting a literature search or in using the common indices. The typical introduction to performing a literature survey was informal, with their graduate supervisor providing an initial set of articles and a few hints on locating further items of interest. Most researchers were essentially selftaught in using information resources during graduate school, with some help in searching coming from a brief “using the library” seminar (two respondents) or informal
advice from fellow graduate students. Subjects with graduate students uniformly reported that they gave their students little guidance in conducting searches, and expected the students to pick up searching skills on their own. Many of the researchers were apologetic about their self-perceived low skills in searching, or lack of knowledge about CS indexes. On the other hand, all subjects were practising and publishing researchers, who would presumably immediately rectify a problem that seriously hampered their ability to perform their work. While they may feel that they are sometimes not performing “correct” or “perfect” searches, they undoubtedly find enough useful material to satisfy themselves and their reviewers/readers. • preferred searching resources: Formal indices such as the Science Citation Index were never or rarely consulted. Indeed, none of the participants reported using even infrequently two of the “standard” CS bibliographic resources (the ACM Guide to Computing Literature and the CompuMath Citation Index), and only two participants had consulted the FirstSearch databases more than once. Most participants reported using the Internet in searching for or acquiring research articles: by searching WWW repositories of CS documents, or by using general WWW search engines. • use of the local library collection: The university library collections were rarely browsed, even to examine recent issues of journals the researchers felt were central to their field. Members of one of the two departments studied felt that their library resources were completely inadequate, and reported relying on private journal subscriptions (their own, and colleagues; these journals are kept in faculty offices, and essentially form a private departmental library). This reluctance to use the library and preference for personal or informal collections mirrors similar findings for engineers [8]. Interloan was used to obtain articles only if they were not available (or could not be located) on the WWW. • preferred searching techniques: None of the subjects reported using controlled vocabularies (such as the ACM subject headings). Instead, they preferred: to perform keyword searches (primarily of the WWW, and less frequently of the local library online catalog and CDROMs such as the Science Citation Index); to browse for articles of interest (through physical copies of journals, and homepages of researchers and research institutions that they felt were likely to produce high quality work in their area); to follow reference trails (by locating articles cited in papers that they had previously obtained); and to rely on their “invisible college” (receiving articles of interest from colleagues, or being
informed of recent research in email discussion lists and newsgroups). When asked specifically about browsing, several reported browsing bookshelves in technical book stores and university libraries when travelling, and most described browsing nearby bookshelves or journal issues when physically retrieving a document from the library. One mentioned browsing in the WWW, and felt that one of the appeals of the WWW is that searching is "opportunistic": "when searching for something else, I find something interesting." Cronin and Hert [4] discuss the "information foraging" model of searching, in which researchers deliberately browse for new ideas. These CS researchers report that they rarely browse to come up with new research ideas, although several had done so in the past, in earlier stages of their careers; it seems that as the researchers have developed specialties they rely less on serendipity for new ideas. Most researchers emphasized the importance of knowing the primary journals in one's area, and browsing those journals regularly. • preferred document type: Books were seen as likely be out of date, and useful more as general subject references than as primary sources of information. Most researchers stated that, given a choice, they would prefer to use/reference a journal article over an article from a conference proceedings, and would prefer a conference article over a technical report. The importance of refereeing for quality control emerged frequently: for example, “I do not have much trust in the sort of papers that are included in conference proceedings.” In apparent contradiction, however, many of the subjects reported searching WWW technical report archives, or retrieving conference publications from the WWW. In these cases, the subjects stressed the importance of obtaining the most recent information, and observed that the refereeing/publishing process can be so lengthy as to render articles obsolete: "Almost everything I read is from technical report sites on the Web and I am reading less and less from journals...I know that formal, refereed journals have a place, but I get more of the technical reports — they're more up to date." Another mentioned that "top conferences are my first port of call. Journal articles are more comprehensive, but very out of date. It's not unusual to have three years from submission to publication, at which point you're a whole generation of grad students out of date." • preferred document age: Computing is a rapidly changing field, with a relatively high obsolescence rate for research documents [5]. Not surprisingly, the majority of the subjects reported that they would generally expect useful documents to be less than five years old, and probably the most immediately relevant articles would
also be the most recently published. However, they also stated that they would use a document of any age if it were relevant. • search intermediation: Only one of the subjects had asked a librarian to perform a search for him. Searching for documents is so tightly bound into the research process that using a search intermediary seems unproductive: "I can't believe that anyone other than me could guess what would be relevant." All reported doing the majority of their literature searching themselves, although they might also rely on the searches of a colleague (for a collaborative paper) or a graduate student. • extensiveness of search: Building a comprehensive bibliography is not seen as a high priority. The researchers stressed the importance of having a field of expertise, and getting to know the key players and seminal works. Their own contributions would generally be heavily based on one or a few documents, probably published “recently”. Amassing a large set of references may even be counter-productive: "I know people who know the literature too well and never get any research done. They think everything has been done. I think [a too comprehensive survey] stifles one's creativity. The referees will tell me if I have missed some important reference." Once retrieved, a potentially relevant document is characterized according to its potential usefulness, stored for later re-use, and sometimes shared with another researcher: • organization of research materials: Documents retrieved in electronic format were uniformly printed, and in most cases only the printed document was retained (computer files were reported to be too large to keep indefinitely, although they might be stored while they were of immediate use). In organizing the paper copies, some subjects maintained elaborate filing systems (with documents arranged by a self-assigned cataloging number, sorted alphabetically by first author’s last name, or grouped according to subject or project); others relied on stacks of papers loosely sorted by project and subject in a highly idiosyncratic manner: "piles, generally not by project...boxes of reprints, folders of notes, folders in the filing cabinet, pigeonholes with papers and drafts for recent papers...basically it's chaos. Usually I manage to find the most important stuff." • collaborative work: All but one of the subjects reported performing most research collaboratively (whether with a colleague or student). Collaboration is an accepted, and indeed expected, part of doing computing research: "I think it is very important to collaborate on research. I believe people work better together." A "willingness to work with other people" is
felt to be important, and co-authorship is "the way to do things." Collaborators will of course be familiar with and share some of the same documents (or they would find it difficult to communicate, much less collaborate!), but all authors of a paper do not necessarily acquire or read all of the references cited in their work. • duration of projects: Some pieces of research are short term or "one-off" papers/projects. More usually, the researcher will have a long term project or research focus that is expressed in terms of a series of conference and journal papers. In this case, the set of references for a particular piece of writing will most likely have been selected from an already existing collection that the researcher has amassed for earlier work, perhaps augmented by a few new references needed to substantiate novel portions of the new work. In other words, each new article produced does not necessarily require a significant effort in terms of searching for references.
3. Discussion A primary concern of the researchers is the perceived ease of use of a proposed document index or repository: not surprisingly, computing researchers prefer to use computers and software to search whenever possible, and dislike using paper indices. They are much more likely to use an information resource if it is available electronically from their own offices; searching can then be more readily integrated into the research itself, and does not require a special trip to the library. Indeed, two of the respondents who consult the FirstSearch databases reported that they only began using the system when it became accessible by telnet from their personal accounts, and had not used it when FirstSearch was only available from library terminals. On a related note, computing researchers tend to be impatient with old-style command line, menu-driven systems (unfortunately, currently the norm for commercial bibliographic databases). Those who had used the CDROMS (such as the Science Citation Index), FirstSearch, or Carl’s UnCover disliked the old-fashioned, “clunky” feel of their interfaces: "I’m waiting [to use them] until they have a decent interface"; "standard web tools are quicker, easier, and lead to stuff you can get immediately". The fact that many documents located over the WWW are also available electronically was seen as an a significant advantage: “Almost everything I want is sitting there and it takes two minutes to get it." Conventional indices and catalogs, by contrast, generally provide only bibliographic information, and any documents located will likely have to be interloaned — adding about three weeks to the time needed to physically acquire the document. Moreover, the subjects complained that bibliographic information alone is
often not sufficient to determine whether or not a document is relevant to their needs (the current commercial CS resources generally do not include abstracts). Documents found on the WWW, however, can be immediately viewed or printed to determine their potential relevancy. Keyword, rather than controlled vocabulary, searches were the norm; the extensive efforts of librarians to provide subject thesaurii and indices seem to have been largely wasted. The fact that the researchers received little formal instruction in searching seemed to be significant here, as the effective use of controlled vocabulary requires training. While it would be easy to advise that CS researchers should learn to do searches “right” (that is, to utilize tools that have been developed by information specialists), evidently the information searches that are performed have acceptable success rates and are not unduly timeconsuming. From a utilitarian point of view, then, the keyword search method is “right” for these researchers, and elaborate or rigorous controlled vocabulary schemes are not required in this domain, by these users. The majority of the researchers did, however, report regularly using several forms of subject-based browsing: scanning subject groupings in Computing Reviews, browsing the table of contents for journals known to be of interest, and scanning library shelves near retrieved books or journals. Further, several reported extensively browsing small, WWW-accessible document collections that focussed specifically on their research area. It appears that some subject classification support is useful, in terms of providing rough, large groupings of like documents to be browsed (rather than small, highly specific classifications to enhance search precision). Additional types of browsing used frequently are browsing by author (checking an individual’s home page, for example), and by institution (scanning a research group’s list of publications or ftp site). These types of searches are well-supported by most commercial indices, but poorly handled by current Internet search engines and WWW-accessible CS document collections. General WWW search engines force author and institutional names to be searched as keywords, usually yielding large numbers of retrieved but irrelevant items. CS document collections, such as the technical report digital libraries, are either not formally cataloged (again, forcing the user to resort to keyword searches with low precision), or have spotty cataloging that makes author and institutional searches a hit-or-miss proposition [12]. Conventional descriptions of information gathering generally assume that users are attempting to gather a comprehensive bibliography, and that their needs are tightly specified in advance [2]. CS researchers, on the other hand, adhere more to the “berrypicking” mode of
retrieval: they are generally satisfied with only a few of the many potentially relevant documents, and their information need may be vaguely specified or emerge as they search (“I’ll know what I want when I see it”). Since they tend to work in long term projects, they maintain personal document collections that grow gradually with time, and as short-term goals are achieved. Comprehensive literature surveys, involving a variety of resources, are conducted primarily when entering a field (such as when beginning doctoral work).
4. Implications for information retrieval system design We are currently investigating a number of interface and collection management issues that are emerging from this study, and exploring ways that these findings can be incorporated into our Internet-accessible CS “digital library” of computer science technical reports [12]. In particular, we are examining the following interface considerations: •
•
expected level of bibliographic expertise: CS researchers have little formal training in searching, and are likely to be unaware of finer points of bibliographic information specification and storage. They are also unlikely to approach a trained search intermediary for help, and prefer to perform searches themselves. An information retrieval system for these users should not require that they know how to perform different types of search based on cataloging level (for example, to differentiate between keyword and controlled vocabulary searches), or demand an intimate understanding of the retrieval system’s underlying architecture (for example, to select between ranked or boolean retrieval). browsing support: The implementation of browsing by site of document origin (which corresponds to the author’s institution in a majority of cases) is straightforward. Browsing by author requires formal cataloging of documents (and a correspondingly high maintenance overhead]. The current version of our digital library supports an approximation of author browsing in our uncataloged system by permitting users to limit searches to the first page of documents; performing a keyword search of the author’s name on the first page will have a high recall for documents written by that person. Precision will be low for common names, however, and the user cannot rely on a standard format for names that will apply across the document collection.
•
•
•
Browsing by subject is more problematic. CS researchers resist learning controlled vocabulary or classification schemes, and classification is timeconsuming and expensive as well. Several possibilities are open for economically obtaining rough groupings of documents by subject: automatically clustering documents by term similarities [9]; using machine learning techniques to induce subject categorization rules ([1], [3]); and providing browsing access to existing subject groupings implied by a document’s physical origin (for example, browsing a research lab’s ftp site, or a given electronic journal’s repository of issues). Some research communities are currently building small Internet-accessible collections of articles tightly focussed on their area. These repositories could be indexed as “sub-collections” of a larger CS library, and would provide a natural focus for subject browsing. citation searching: Formal citation searching, such as that provided by the Science Citation Index, is extremely resource intensive to support, as the references for each document in the collection must be manually identified and forced into a common format. The NZ Digital Library [12] is unique in that it indexes on the full text of documents, including any references. Anecdotal reports from its users indicate that keyword searches by author name has proved useful in citation searching, since documents that reference the author’s works will be retrieved. This type of search has a low precision rate, however. If the reference section of documents in the collection could be automatically identified, then a keyword search limited to that section could provide a much more precise approximation of citation searching. One promising technique for supporting citation searching is the graphical, asynchronous “Butterfly” presented in [Mackinley et al, 1995]; this work presents an attractive visualization for citation links, and describes an architecture for conducting link construction semiautonomously. support for “invisible colleges”: Most subjects had found email lists or WWW pages targeted to their research; these resources provided pointers to useful documents, as well as enhancing communication with a group of (sometimes geographically widespread) researchers having similar interests. These community-building communication facilities could be brought under the umbrella of a larger CS digital library, perhaps based on the “sub-collections” concept described above. archival facilities: CS researchers rarely archive the files they retrieve, preferring to print documents and discard the file itself. In the short term, the file will
continue to be available: "For the most part, I print and throw the file away. I can always get it back...it will always be there tomorrow. There's a certain amount of trust.” In the long term (perhaps even a few months!), however, the file may be moved or even disappear. Since our digital library is a “harvester” collection that stores only a pointer to distributed documents, we cannot assure continued access to documents we index. This situation may come to be accepted by the CS research community, particularly given the very high obsolescence rate of CS documents. If this collection instability is rejected by the community, the “harvesting” digital libraries such as our own will have to maintain archival repositories of indexed documents (essentially, losing the small storage requirements advantage of a distributed collection).
5. Conclusions This paper presents preliminary results of an examination of the ways that academic computer science researchers in New Zealand search for, use, and store information when conducting research. An understanding of the ways that computer science researchers prefer to locate document collections can provide a useful guide to the design of information retrieval systems for the computing community. We further intend to expand this study to include academics in the United States; it appears likely that the geographic isolation of New Zealand may encourage its academics to rely more heavily on the Internet and electronic communication than is the norm, since conventional means for acquiring documents (by interloan or direct purchase from publishers) are both extremely expensive and slow.
References [1] Apte, C., Dameru, F., and Weiss, S.M. (1994) “Automated learning of decision rules for text categorisation,” ACM Transactions on Information Systems 12(3), 233-250. [2] Bates, M.J. (1989) “The design of browsing and berrypicking techniques for the online search interface,” Online Review 13, 407-424. [3] Cohen, W.W. (1995) “Text categorisation and relational learning,” Proceedings of the 12th International Conference on Machine Learning, Tahoe City, California.
[4] Cronin, B., and Hert, C.A. (1995) “Scholarly foraging and network discovery tools,” Journal of Documentation 51(4), 388-403. [5] Cunningham, S., and Bocock, D. (1995) “Obsolescence of computing literature,” Scientometrics 34(2), 255-262. [6] Mackinlay, J.D., Rao, R., and Card, S.K. (1995) “An organic user interface for searching citation links,” Proceedings of CHI ‘95. Available at [7] Peters, T.A. (1993) “The history and development of transaction log analysis,” Library Hi Tech 42(11), 4166. [8] Pinelli, T.E. (1991) “The information-seeking habits and practices of engineers,” Science and Technology Libraries 12, 5-16. [9] Salton, G., and McGill, M.J. Introduction to modern information retrieval, McGraw-Hill Book Company, 1983. [10] Sievert, D., and Sievert, ME. (1989) “Philosophical Research: report from the field,” Proceedings of the Humanists at Work symposium (April, Chicago, ILL, USA). Published by the University of Illinois at Chicago. [11] Van Heyningen, M. (1994) “The Unified Computer Science Technical Report Index: Lessons in indexing diverse resources,” Proc. Second International WWW Conference, Chicago. [12] Witten, I.H., Cunningham, S.J., Vallabh, M., and Bell, T.C. (1995) "A New Zealand digital library for computer science research," Proc Digital Libraries '95, 25-30, Austin, Texas, June
Appendix: selected questions from interviews •
what is your speciality or research focus?
•
for your last few papers, where did the research idea come from?
•
How do you "do" your research? Organization of work with co-authors Use of the Internet for communication with co-authors Tell the story of the development of the paper
Do you work with long term projects, or short term focus? •
How do you organize the material that you're using for a paper?
•
in your field, how are co-authored papers viewed, as opposed to single author papers?
•
do you ever just browse or search looking for new ideas?
•
where did you obtain the papers cited in your last few papers? (probe: in library, interlibrary loan, found it on the Net, given a copy by a friend)
•
did you search for the papers yourself, or did you have a graduate student or co-author look?
•
do you teach graduate students how to search the literature?
•
how do you find articles? (probe: browse journals, books in library or other collections, ACM Guide, CompuMath Citation, Computer Abstracts, Computer Reviews, subject cd-roms, paper indices, FirstSearch, general Internet search, tech report servers/libraries, Internet-accessible bibliographies, ftp repositories, email/news inquiries, library catalog, UnCover, Dialog, visiting other libraries, ask a librarian, following references in existing papers)
•
when you are building a set of references for a paper, how concerned are you about getting the latest papers on the subject? the most "important" papers? a comprehensive bibliography?
•
do you have any preferences for using books, or for using journals?
•
how recent should a publication be for you to cite in a paper?
How did you learn to do literature searches?