Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
1
Wittgenstein and Indexing Theory Jack Andersen & Frank Sejer Christensen Royal School of Library and Information Science, Copenhagen, Denmark
Abstract The paper considers indexing an activity which deals with linguistic entities. It rests on the assumption that a theory of indexing should be based on a philosophy of language, because indexing is concerned with the linguistic representation of meaning. The paper consists of four sections: It begins with some basic considerations on the nature of indexing and the requirements of a theory on this. It is followed by a short review of the use of Wittgenstein’s philosophy in LIS-literature. Next is an analysis of Wittgenstein’s work Philosophical Investigations. Finally, we deduce a theory of indexing from this philosophy. Considering an indexing theory a theory of meaning entails that, for the purpose of retrieval, indexing is a representation of meaning. Therefore an indexing theory is concerned with how words are used in the linguistic context. Furthermore, the indexing process is a communicative process containing an interpretative element. Through the philosophy of the later Wittgenstein, it is shown that language and meaning are publicly constituted entities. Since they form the basis of indexing, a theory hereof must take into account that no single actor can define the meaning of documents. Rather this is decided by the social, historical and linguistic context in which the document is produced, distributed, and exchanged. Indexing must clarify and reflect these contexts.
Introduction The philosopher Ludwig Wittgenstein (1889-1951) has in his influential work Philosophical Investigations (1958) proposed a view on language and language use which we think can be related to a theory of indexing. When speaking of indexing and indexing theory within library and information science (LIS) not much attention has been or is paid to language and, in particular, philosophy of language. Language is not considered a problematic entity. Indexing, and in particular the actual document analysis, is very often in the LIS-literature described as something we “just” do (cf. Lancaster, 1998). Language is seen as something secondary in relation to indexing and indexing theory, despite the fact that indexing and indexing theory is the examination of linguistic expressions and contains, therefore, a significant linguistic dimension.
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
2
In the following we will examine this linguistic dimension and discuss indexing and indexing theory in the light of language and the philosophy of language. Concerning indexing theory, we intend to use the view of language of the later Wittgenstein as it is expressed in his Philosophical Investigations as our premises in order to deduce a theory of indexing. We begin with some general considerations of indexing and indexing theory and the concepts of meaning and interpretation. This is followed by a short review of the use of Wittgenstein’s philosophy of language in the LIS-literature. After this we will outline some relevant themes in Philosophical Investigations as we think they relate to this paper. Eventually, these themes will be related to indexing theory.
General considerations of indexing and indexing theory Indexing is a process with the purpose of generating a representation of the subject(s) of a document. Indexing can be divided into two steps (cf. Lancaster, 1998; and figure 1): 1) an analysis of the subject(s) of the document; and 2) translating this/these subject(s) into the particular indexing language or IR-language. An indexing theory has to answer these questions: What is a subject? How is it uncovered? How is it represented as best as possible? We will not discuss what a subject is. For a clarification of this concept, we will refer to Hjørland (1992; 1997).
Document
Conceptual analysis
Indexing language
Figure 1: Simplified illustration of the two-step model; i.e. the indexing process As pointed out by Blair (1990), indexing theory can be seen from the point of view of philosophy of language because philosophy of language is, among other things, concerned with words and concepts and their possible relation to reality. Indexing theory is also, or at least should be, concerned with words and concepts. However, this is for the purpose of representing a document or documents most appropriately to facilitate retrieval. Therefore, indexing theory contains a considerable pragmatic dimension. The representation of a document says something about both the actual document and the reality or social context it may represent or reflect. We believe therefore that indexing theory can be seen from a philosophy of language point of view and that is what we will demonstrate in the following. The two-step model mentioned above is characterised by being based on a given document. But a given document cannot influence a theory of indexing.
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
3
According to our point of view, a theory of indexing is to be found outside the given document because a theory of indexing, among other things, is supposed to formulate some fundamental principles for this two-step model. The actual document is merely what indexing theory is used on. A theory of indexing therefore has to be found outside the document. That is why the document is not of primary importance. The two-step model is both an expression of the actual indexing process and subsequent indexing practice. Furthermore, this two-step model is rather an introduction to indexing practice than it is an expression of indexing theory, even though Lancaster (1998) seems to consider it like indexing theory. But, basically, a theory of any kind is not a question of what can be realised in practice, but, rather, as Quinn (1994, pp. 146) states in relation to classification and indexing theory: “Theory can increase understanding and guide practice” (our italics). Concerning indexing and the two-step model, Frohmann (1990, p. 82) has stated that “Indexing is generally taken to consist of at least two distinct operations. The first involves either the implicit or explicit representation of a document by an indexing phrase. The second involves the translation of the terms of the indexing phrase into the lexicon of a controlled vocabulary, with due regard for the semantics and syntacs of the indexing language. Most research focuses on the second step, while the first continues to be lamented as an intellectual operation both fundamental to indexing yet so far resistant to analysis.” In the LIS-literature similar statements have been made by, for instance, Hutchins (1978), Langridge (1989) and Chu & O’Brien (1993). Despite the recognition of the importance of this first step, the development of an explicit theory of indexing within LIS is, in our point of view, still lacking. Research within LIS on classification and indexing has resulted in a lot of knowledge about the actual translation process, i.e. the second step in the model, construction of indexing languages, classification systems etc. But we believe that a theory to explain and create a better understanding of this two-step model do not exist within LIS. In the LIS-literature not much attention is paid to this lack of an indexing theory. As Quinn (1994, p. 141) states quiet rightly: “The lack of an indexing theory to explain the indexing process is a major blind spot in classification.”. Regarding indexing, Frohmann (1990, p. 81) has also remarked that what is peculiar to LIS as a science, is to represent documents conceptually for the purpose of retrieval. It is, therefore, incomprehensible that LIS hasn’t been able to formulate and explicate a theory of indexing. Consequently, we believe it is the first step in the two-step model which has to serve as the foundation for a theory of indexing and that is what we will concentrate on in the following.
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
4
When actually speaking of theories of indexing within LIS the fundamental problem is, paradoxically, that they are non-theory based. Theories of automatic indexing and classification should serve as a classic example of this. As far as we are concerned, no formulated principles of what it means to index and a thorough understanding of indexing is developed within LIS. In our opinion Wilson (1968) and Lancaster (1998, pp. 1-18) is the closest one can get of an LIS insight of an explicated theory of indexing. By using the philosophy of language of the later Wittgenstein as it is expressed in his Philosophical Investigations, we believe we can provide a contribution to a theory of indexing.
Meaning and Interpretation Føllesdal et al. (1992, p. 169) connect language with a communication process. This means that, when dealing with language and philosophy of language in relation to indexing theory, an indexing process can be seen as a communication process. Hence, a theory of indexing contains an essential aspect of the philosophy of language. This is also suggested by Blair (1990, p. 122): “...The process of representing documents for retrieval is fundamentally a linguistic process....Thus any theory of indexing or document representation presupposes a theory of language and meaning.” (our italics) Speaking of the relationship between language and communication, Føllesdal et al. (1992, p. 169) add that in the communication process language is necessary, because we cannot transfer information directly. When speaking of indexing, we don’t transfer information directly either. Rather, information is transferred by the use of document representations. These document representations are constituted by language. Hence, communication of information happens through language. Føllesdal et al. (1992, p. 176) further states when speaking of meaning that the meaning of a linguistic expression is what it brings to express; that’s what we know when we understand it and thereby meaning is the basis of all communication. This means, when speaking of viewing indexing as a form of communication, that a theory of indexing must be a theory of meaning. Since meaning is the basis of all communication, this means that communication without meaning is not possible, that is, the communicative act cannot take place. When the later Wittgenstein (1958, § 43) asserts that meaning is use it should be understood in the sense that meaning is something we create in order to gain understanding. In order to communicate at all human beings, conceived of as social actors, need to have a certain degree of understanding of the situation (the language game) within which something is being communicated. Therefore, in the language game we have created a meaning which
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
5
ensures this degree of understanding. That is why meaning is use. And that is why it is not important whether what is being said is true or false but, rather, whether it is understood or misunderstood. Thus, to say that meaning is use and meaning is the basis of communication does not reflect a contradiction. Assuming that meaning is the basis of communication, the communication process is in this connection not identical to Shannon & Weaver’s (1949) conception of the communication process (1). Shannon & Weaver are concerned with transmission of intended meaning. Intended meaning means that meaning is built into the transmitted message and, with that, given in advance. Fiske (1990, pp. 2-3) calls the approach of Shannon & Weaver for “the process school”. This approach to the study of communication sees the communication process as consisting of an encoder who has an intention with the message being sent, a channel through which the message is being sent and decoder/receiver. The intention being a change of what the cognitive viewpoint within LIS would call the knowledge structure of the user, i.e. the receiver (cf. Ingwersen, 1992). The conception of the concept of meaning is very similar to the information concept of the cognitive viewpoint (Ingwersen, 1992, p. 33). However, Fiske (1990, pp. 2-3) speaks of another approach to the study of communication, “the semiotic approach”. “The semiotic approach” to communication studies views the communication process as being that of production and exchange of meaning. The study of communication in this approach is the study of texts and culture. That is, “the semiotic approach” focuses on how texts interact with people in order to produce meaning. Those involved in a communication process are all influencing each other and the meaning received is not necessarily the one intended. Rather, meaning is something which is being created and exchanged by those involved. This semiotic approach and its conception of meaning is very similar to the theory of meaning the later Wittgenstein developed in his Philosophical Investigations. Here, the meaning of a word is its use in language (Wittgenstein, 1958, § 43). This circumstance that the meaning of a word is its use in language is our premise when we in the following are going to talk about indexing theory; that is, how do we index documents when the meaning of words arise in use or, with the terminology of Wittgenstein, in the language game. Therefore, with this we claim that in considering indexing theory a theory of meaning entails that, for the purpose of retrieval, indexing is a representation of meaning. Therefore, a theory of indexing is basically and necessarily a question of how words are used in the linguistic context (that is, the language games) they are a part of. Consequently, a question of instrumentality or functionality. This is to say that language use plays a determining role and, when speaking of indexing theory, that meaning can not be presupposed but is determined in use. As Blair (1990, p. 145) points out with the words of Wittgenstein: “...we don’t start from certain words, but from certain occasions or activities.”. On communication, and for the matter of this paper indexing theory, we can further state that it contains an interpretative aspect. Since language is a mean of communicating meaning, we need to interpret the meaningful expressions in order to act upon them; that is, to understand the meaning “lying behind” these meaningful
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
6
expressions. From this it follows that meaning and interpretation are central concepts in a communication process and, therefore, in an indexing process. Within LIS researchers have treated modern hermeneutics in various ways in relation to theoretical LIS problems. For instance, Benediktsson (1989) and Hoel (1992) both discusses the possibility of LIS taking its theoretical foundation in modern hermeneutics. Ingwersen (1992) points out the overlaps between the cognitive viewpoint and the hermeneutics of Heidegger and Gadamer. Ingwersen pays special interest to Gadamer’s concepts of pre-understanding and horizons of understanding and the cognitive viewpoint’s concepts of individual knowledge structures and world models (Ingwersen, 1992, p. 42); concepts which the cognitive viewpoint makes of use in IR interaction. Christensen (1994, p. 38), in her attempt to explore hermeneutic ideas and concepts and their possible relevance to LIS, ascertains with regard to indexing, but without speaking about indexing theory, that central concepts from hermeneutics such as understanding, interpretation, communication and language are of great significance for the subject analysis and indexing process and the representations as the result of this process. Through his “Interpretive Viewpoint” Cornelius (1996a, pp. 11-22) uses modern hermeneutics in his theoretical considerations of the relationship between information and interpretation. Further, Cornelius (1996b) uses modern hermeneutics in an attempt to deliver a foundation for LIS theory and practice. Hjørland (1997, p. 34), however, states that the relationship between hermeneutics and LIS has never been discussed in order to relate modern hermeneutic ideas (in particular, Hjørland mentions the ideas of Gadamer) to subject data, and thereby indexing. As we intend to show later, there are certain similarities between the philosophy of the later Wittgenstein and parts of hermeneutics when speaking of indexing theory.
Wittgenstein and Library and Information Science In general, few works in the literature of LIS are based on the philosophy of language. Neither do many have the later Wittgenstein’s philosophy of language as basis when speaking of indexing theory. In the following we will give a brief review of those within LIS who have dealt with Wittgenstein in their works. This review is, of course, not at all exhaustive (2). In addition to McLachlan (1981) and Nedobity (1989), Hjørland (1998) points out, nevertheless, that within LIS a growing interest (in the 1990’s) of Wittgenstein can be traced. Examples are Blair (1990, 1992), Frohmann (1990), Warner (1990), Brier (1996a, 1996b), Karamüftüoglu (1996, 1997, 1998), Talja (1997), Tuominen (1997) and lately Hjørland (1998) and Mai (1998). Warner (1990) is the only one using the
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
7
early Wittgenstein, while Nedobity (1989) and Hjørland (1998) use both the early and later Wittgenstein. The others rest on the later Wittgenstein. In order to discuss the classification theory put forward by Buchanan in his work “Theory of Library Classification” (1979), McLachlan (1981) makes use of the later Wittgenstein’s philosophical considerations. According to McLachlan, Buchanan’s view of classification is that it can be seen as a mental activity and McLachlan supports this view. With that, McLachlan dissociates himself from the ideas of the later Wittgenstein because the later Wittgenstein would claim that it is through the language game we have access to the concepts. In this connection McLachlan sticks to a theory of John Locke which McLachlan himself names “linguistic mentalism” (McLachlan, 1981, p. 191). According to McLachlan, Buchanan claims that members of a class must necessarily have something in common, but McLachlan thinks that Buchanan’s assertion is untenable referring to the later Wittgenstein’s theory of family resemblances (McLachlan, 1981, p. 195). As mentioned before, Nedobity’s (1989) point of departure is both the early and later Wittgenstein. Nedobity tries to evaluate the different methods used in investigating the meaning of a term. In doing this, the theories of meaning from Wittgenstein’s youth work Tractatus Logico-Philosophicus and Philosophical Investigations are used together with Eugen Wüster’s investigations of the relationship between concepts and their representation. Warner (1990) uses only the early Wittgenstein. In order to theorise about the relationship between documents and computers, Warner uses the logical considerations and principles presented in Tractatus Logico-Philosophicus. The inner language of the computer is illustrated by these logical considerations and principles. But because Warner ascertains that the relationship between LIS and semiotics is yet undiscovered, Warner’s primary errand is to amalgamate a fundamental principle for both documents and computers. In his semiotic analysis of IR-systems and IR-interaction, Karamüftüoglu (1996, 1997, 1998) makes use of the theory of language games developed by the later Wittgenstein. The theory of language games is used to explain that the significant distinction in IR-interaction is a distinction of two specific forms of language games: Language game “denotation” and language game “prescription” (Karamüftüoglu, 1996, p. 85) With an epistemological and methodological basis, Talja (1997) presents the “discourse analytic viewpoint”, as developed by Michel Foucault, as an alternative to the cognitive viewpoint within LIS. Talja (1997, p. 70) uses the later Wittgenstein’s private language argument to reject the cognitive viewpoint’s conception of concepts and categories as being mental representations. By the use of discourse analysis, Tuominen (1997) examines Carol C. Kuhlthau’s book “Seeking Meaning: A Process Approach to Library and Information Services.”
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
8
(1993). Touminen holds that Kuhlthau’s book doesn’t offer a critical reflection as to how researches within LIS actually are constructing users. With Kuhlthau’s book as a reference, Touminen believes that within LIS there’s a tendency towards constructing the librarian as a physician and the user as a patient. With regard to information needs, Touminen doesn’t believe that they are an expression of something mental which is up to the librarian or information specialist to diagnose. Here, Touminen refers to the later Wittgenstein’s private language argument and deduces that the user’s information need always is a social construct rather than an individual construct (Touminen, 1997, p. 361). Brier (1996a, 1996b) argues for a theoretical foundation for LIS based on the later Wittgenstein’s theory of language games, the semiotics of Peirce and second order cybernetics. However, Brier’s discussion of the later Wittgenstein is limited to a review of Blair’s (1990) reading of the later Wittgenstein. Mai (1998, p. 231) argues that LIS in general and the organisation of knowledge in particular, needs an epistemological foundation. Starting from the epistemological view the later Wittgenstein presents in his “On Certainty”, Mai (1998, p. 240) points out that organisation and representation of knowledge takes place within a certain social practice, and that this practice is the crucial factor in the organisation and representation of knowledge. Hjørland (1998) puts forward that IR-theories must relate to theories of meaning. Specifically, Hjørland discusses the consequences for the different subject access points in databases and for IR as a whole, if the theories of meaning presented in “Tracatus Logico-Philosophicus” and Philosophical Investigations, respectively, are the point of departure. As shown in this brief review, the use of Wittgenstein within LIS with regard to indexing theory is minimal. Therefore, we believe that Blair (1990, 1992) and Frohmann (1990) are the only ones who specifically deal with Wittgenstein’s philosophy of language in order to discuss and analyse problems in indexing theory. Blair (1990, pp. vii-viii), for instance, holds that “The central task of Information Retrieval research is to understand how documents should be represented for effective retrieval. This is primarily a problem of language and meaning. Any theory of document representation, and, by consequence, any theory of Information Retrieval must be based on a clear theory of language and meaning.” (our italics)
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
9
So, Blair believes that indexing theory should take as its starting point the philosophy of language, that is the philosophy of language of Wittgenstein in particular, but also the philosophy of language of John L. Austin and John R. Searle. Frohmann (1990) criticises what he calls for mentalism in IR. Frohmann thinks that much attention has been on discovering mental rules on how to deduce index terms. Arguing with the rule-following considerations put forward by Wittgenstein in Philosophical Investigations, Frohmann believes, regarding indexing theory, that we ought to go from “rule discovery” to “rule construction”. In the following, we consider our paper to be a continuation of Blair’s and Frohmann’s very important contributions to our understanding of indexing theory within LIS.
Conception of language in Philosophical Investigations We will now take a closer look on the conception of language in Wittgenstein’s Philosophical Investigations. We have chosen what we consider as four relevant and central themes in Philosophical Investigations: Language games, family resemblances, rule-following and the private language argument. In Wittgenstein’s youth work, Tractatus Logico-Philosophicus (1961), it was the sentence which was the fundamental meaning bearing and mediating entity. In Philosophical Investigations it is the language game. In Philosophical Investigations a varied conception of language is established. Language is seen as something constituted by the presence of language games. The question of what is being said, is not whether it is true or false but, rather, if it is understood or misunderstood. Language games We consider the concept of language games the central point of Philosophical Investigations. Hence, it also serves as a main concept for the three other themes. Wittgenstein opens up Philosophical Investigations by citing a conception of language presented by St. Augustine. St. Augustine has learned language by observing how adults speak. With this St. Augustine has learned to understand which objects the various words designate (Wittgenstein, 1958, § 1). According to Wittgenstein, St. Augustine hereby believes that he has learned language; i.e. when you’ve learned what all words in language designates, then you’ve learned language. Wittgenstein believes this conception of language presents a particular picture of what he calls the essence of human language:
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
10
“It is this: the individual words in language name objects -sentences are combinations of such names.- In this picture of language we find the roots of the following idea: Every word has a meaning. This meaning is correlated with the word. It is the object for which the word stands.” (Wittgenstein, 1958, § 1) Wittgenstein now introduces the concept of language games. According to Wittgenstein, the way we learn language does not consist in a game of inventing names (which is, however, also a kind of language game, although primitive); that is, the view that we take an object and point at it in order to learn somebody the name of that object; in other words to use ostensive definitions. To this Wittgenstein will reply that we have learned nothing about the object because we do not know what it is we are pointing at. The form of the object? Its color? And what is color? What does pointing mean? etc. However, Wittgenstein will not exclude ostensive definitions as a means of learning language, but points out that it implies some knowledge of language and the language game it is involved in. If one merely can say the name of the object, one cannot claim to have learned the language game because one have not learned to use the word, and with that, according to the later Wittgenstein, what it means. Hereby, Wittgenstein contests the theory of naming he put forward in Tractatus Logico-Philosophicus. Here words are the names of objects and the name of the object is its meaning. That is, a conception of language as something which is constituted of names, and something for which conditions of truth can be put forward. For example “Here is a house and it is green” unlike “Ouch!”. According to the later Wittgenstein, a meaningful sentence is not necessarily a picture of state of affairs. Wittgenstein defines the concept of language games like this: “Here the term ‘language game’ is meant to bring into prominence the fact that the speaking of language is part of an activity, or of a form of life” (Wittgenstein, 1958, § 23) To speak a language is in itself a social action, a collective undertaking, an activity, etc., and the individual is nothing but a language user. This means that by having a language, you have the other and vice versa. Language is basically a social phenomenon. Every linguistic activity is connected to the usage of rules. Being a part of a language game, it is expected that these rules are followed. In this sense rules are an expression of how words are used in the particular language game because: “For a large class of cases-though not for all-in which we employ the word “meaning” it can be defined thus: the meaning of a word is its use in language.” (Wittgenstein, 1958, § 43)
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
11
The meaning of words, then, is not what they designate but what they can be used for. With that Wittgenstein has proposed a new theory of meaning: Meaning is something human beings create as social actors within the framework of a particular language game in order to communicate. This is the opposite of considering meaning as something which is given a priori. This is not to say that language is a totally subjective entity but, rather, that language has an intersubjective character: “”So you are saying that human agreement decides what is true and false?”-It is what human beings say that is true and false; and they agree in the language they use. That is not agreement in opinions but in form of life.” (Wittgenstein, 1958, § 241) With that, language, language games and forms of life are inseparable sizes. Rules in one particular language game are not necessarily rules in other language games because of the diversity of language games. This implies that the usage of words is equally diverse. Wittgenstein compares the diverse applications of a word with the tools in a toolbox: “Think of the tools in a tool-box: there is a hammer, pliers, a saw, a screw-driver, a rule, a glue-pot, glue, nails and screws.-The functions of words are as diverse as the functions of these objects.” (Wittgenstein, 1958, § 11) The individual language games are almost incommensurable (Wittgenstein, 1958, § 65) and, therefore, the legitimacy, validity or truth of language has to be found it its functionality or appropriateness. Truth is an essential part of a language game because criteria of truth consist in and arise from language use. In order to function it is necessary with a minimum level of consensus within the particular language game concerning the usage of words or simply to speak a language. Further, it implies that a language game has no limits but limits are drawn with regard to a purpose (Wittgenstein, 1958, § 69). The act of a language game is not a goal in itself. The language game is some sort of an empirical study of language use; i.e. the language game serves as a mean to describe and explain the possible meaning of the words in the actual language game: “Our clear and simple language-games are not preparatory studies for a future regularization of language-as it were first approximations, ignoring friction and air-resistance. The languagegames are rather set up as objects of comparisons which are meant
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
12
to show light on the facts of our language by way not only of similarities, but also of dissimilarities.” (Wittgenstein, 1958, § 130) Compared to the above mentioned statement concerning the language game as a mean to describe and explain the possible meaning of the words in the actual language game, it follows that when the language games change, then the words change and thereby the meaning of words. It is not, then, the words who change the language games. This is a prerequisite for the understanding of the nature and instrumentality of the individual language games. This entails that access to the meaning of words happens through the language game because the individual is nothing but a language user. Family Resemblances In Tractatus Logico-Philosophicus Wittgenstein defined language as being a picture of the world (i.e. the picture theory of meaning). In Philosophical Investigations language is a name of the class of indefinite language games. The amount of language games is indefinite because new language games can arise and the boundary of what can be considered as a language game is difficult to draw. Due to the diversity of the language games, no absolute recipe exist as to how language is defined. Language has no determining property and the class of language games does not form a community (Wittgenstein, 1958, § 65-66). Language is made up by various language games which both look similar and differ from each other. A word is not an absolute entity which means that it does not have one particular usage and thereby one particular meaning. Words do not refer to one general meaning which refers to reality. The consequence of this is that a given word does not in itself have one clear unambiguous meaning. This means that we are not able to deliver an adequate account of the characteristics something ought to have in order to come under that particular word. As an explanation of this, Wittgenstein speaks of family resemblances (Wittgenstein, 1958, § 67). With the concept of family resemblances, Wittgenstein points out that just because language is an expression of the class of indefinite language games one cannot assume that the language games express something “common”, because what should it be like? Wittgenstein illustrates this with the example “games”, and asks what is the common denominator for games like cardplaying, ball game or a board game etc. Wittgenstein argues against the position that just because these games fall into the category “games” does not mean that they necessarily have something in common. By investigating these games Wittgenstein thinks that one will find that nothing common exists for these different games, but rather resemblances and the like: “To repeat: don’t think, but look!” (Wittgenstein, 1958, § 66). When the class of language games do not have a common denominator but is constituted of family resemblances this also implies that one cannot state complete and unambiguous rules as to the correct application of a given word. Nevertheless, we
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
13
are able to intersubjectively use the word properly because it is the particular language games which state the usage and thereby meaning. To use a word as name then, is just one out of many possible applications (Wittgenstein, 1958, § 27). With that Wittgenstein settles with the conception in Tractatus Logico-Philosophicus that words are only meaningful if and only if they are names. Rule Following With the introduction of the concepts of language games and family resemblances Wittgenstein has argued for two conceptions of language. First of all, the conception that the meaning of words in a lot of cases is not what it designates but arises in use. Second, words do not have one clear meaning. We are not able to give clear and thorough guidelines as to the correct application of a word. A word can be used in many diverse contexts. In relation to this Wittgenstein points out that there is a strong relationship between, respectively, to understand and to use a word and to understand and to follow a rule. Wittgenstein points this out because “The application is still a criterion of understanding.” (Wittgenstein, 1958, § 146). This is because understanding will lead to the correct application. The concept of “application” is what connects understanding and rule following because we apply a word and apply a rule, and both with regard to action. The application of, respectively, a word and a rule happens thus with regard to a purpose. Hence the connection between understanding and application of a word on the one side and understanding and rule following on the other. According to Wittgenstein to follow a rule is a social practice; in other words to follow a rule cannot be something private or individual because “To obey a rule, to make a report, to give an order, to play a game of chess, are customs (uses, institutions).” (Wittgenstein, 1958, § 199) and further “And hence also ‘obeying a rule’ is a practice. And to think one is obeying a rule is not to obey a rule. Hence it is not possible to obey a rule ‘privately’: otherwise thinking one was obeying a rule would be the same as obeying it.” (Wittgenstein, 1958, § 202) To follow a rule is determined by the fact that there is a world, a practice, and it is not an inner intention which determines the world. To follow a rule and to understand a word is not a mental condition because formulating a rule does not in itself explain how the rule is supposed to be followed. Likewise a formulation of a word does not explain how it is supposed to be applied and understood because
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
14
“...what confuses us is the uniform appearance of words when we hear them spoken or meet them in script and print. For their application is not presented to us so clearly.” (Wittgenstein, 1958, § 11) and because “A main source of our failure to understand is that we do not command a clear view of the use of words.” (Wittgenstein, 1958, § 122) Since following a rule and applying a word is not identical to some other mental state, there must be some appearance of some, if not objective, then intersubjective criteria as to when a person has understood a word. These intersubjective criteria have to be found in the language game and is defined by the practice of the particular language game. This means, for instance, that every time we point at a house and say “car!” it is not a meaningless statement. It just means that in that particular language game we agree upon when we point at a house we say “car”. Besides exemplifying the intersubjectivity, this example also shows that the meaning of the statement “car” is its use in language. Since there is a connection between understanding and applying a word and to understand and follow a rule, it follows that the understanding of a word likewise must be something defined by the social practice, and thereby not something mental. If the understanding of a word is something mental we run into problems as how to determine if a given word is understood correctly. In what should these criteria consist in and what mental state should be identified with the understanding? Wittgenstein’s answer to this is that the single language user cannot determine if s/he uses a word correctly: “For even supposing I had found something that happened in all those cases of understanding,-why should it be the understanding? And how can the process of understanding have been hidden, when I said “Now I understand” because I understood?! And if I say it is hidden- then how do I know what I have to look for? I am in a muddle.” (Wittgenstein, 1958, § 153) Private Language Argument Wittgenstein’s distinction between the inner/private and outer/public is further emphasised in his private language argument. Since the understanding of an expression and rule following is not an expression of something private or mental, Wittgenstein does not think that a private language can be maintained (Wittgenstein, 1958, § 243 & § 269). A private language is a language in which the mediation of
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
15
private experiences happens and a language in which no one else than the individual can understand and speak: “The individual words of this language are to refer to what can only be known to the person speaking; to his immediate private sensations. So another person cannot understand the language.” (Wittgenstein, 1958, § 243) Wittgenstein’s denial of the conception of a private language is supposed to be seen in continuation of his conception of language as a fundamental social phenomenon. Language is not learned through pure subjective experiences. If so a conception of a private language should be maintained, Wittgenstein shows that this will lead to absurdities. If one speaks of a private linguistic practice, that is, a practice where only the individual can express himself, then it does not make sense to talk about the word “correct”, or for that matter “incorrect”. This appears by the application of a given word. Here, the individual cannot give a understandable intersubjective definition because the individual is speaking a private language. Furthermore, the individual is not able to state public criteria as to the correct application of a word. In what should these public criteria consist in when it is a private linguistic practice that is being maintained? What is supposed to be consulted in the private language in order to reach a decision as to the correct application? The individual language user cannot determine if s/he is using the word correctly. This implies that persons each having a private language cannot communicate with each other. Unless, communication between minds was possible. This is in itself an absurdity when speaking of intersubjective linguistic communication. With the rejection of the possibility of an private language, Wittgenstein at the same time rejects solipsism. Solipsism assumes that knowledge of reality must take its point of departure in subjective experiences. What can be known with certainty is what the subject can acknowledge immediately through its private senses. The subject, then, has direct and unmediated access to the phenomena.
Philosophical Investigations and indexing theory In the following we will outline the principles of an indexing theory based on Philosophical Investigations. We will proceed in the same manner by taking each of the philosophy’s four themes in turn, and consider their consequences for indexing. Language games and indexing theory Assuming that any given document is a part of a linguistic practice, a language game, then the document in and by itself cannot be the basis of indexing, let alone a theory thereof. Rather indexing must be based on the context and the language game it is a
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
16
part of. Indexing is thus not merely a matter of summarising the document's content conceptually, i.e. producing a document surrogate (cf. Hutchins, 1978). This forms only a part of the document representation. If the representation is to be meaningful it must also reflect the language game that the document is a part of. A document is produced by people partaking in a certain language game which it reflects, but can also be considered a single statement in a slow game consisting of documents. Knowing the document is not in itself sufficient to understand its meaning. We also have to know the language, which implies knowing how to use it. The document representation must conceptually reflect the language game. The subject analysis of the specific document must result in indexing terms which reflects its linguistic context. The scope of indexing theory must be broadened to include a dimension concerned with the sociology of language and knowledge, in order to explain the linguistic and epistemological functions of a document in the particular language game. The information given through these considerations are vital for the proper indexing, and subsequent retrieval, of any document. Concerning knowledge in the language game, Blair (1990, p. 148) states with the words of Wittgenstein: "...the knowledge of language games is a 'knowing how' rather than a 'knowing that'". It is in the light of this, that the sociolinguistic dimension must contribute to the language game of the document. An analysis of the language game is necessary to clarify its conceptual structure, thereby establishing the correct use of the words within it. In other words it will reflect the rules of the particular parent game of the document, at the time of analysis. By meaning of the words is clear when their proper use is established. The analysis of the language game is based on a empirical study of the language use. The conceptual analysis of a document cannot be carried out properly unless the indexer has a good understanding of the language game. The meaning and purpose of a document is not a property inherent to it. Rather its linguistic and conceptual meaning is determined by external factors, within the framework of the language game. If we are to carry out an analysis of a given document, we need to understand it in its context. Indexers are in other words in the business of sensemaking. Føllesdal et al. (1992, p. 85) points out that the prerequisite of a theory meaning is a definition of what constitutes meaning and how it is transferred through the use of signs. An indexing theory based on the notion of language games, cannot consider a document’s content as an objective linguistic entity. A document does not define itself. This notion is also maintained by the philosophical hermeneutics of Hans-Georg Gadamer, who states that there can be no complete and final interpretation of a text (Lübcke, 1996, p. 175). According to this philosophy, it is meaningless to talk of the text as an entity with an independent meaning. A text should not be seen as more than an object of interpretation, and the sum of these make up its meaning(s). There might, at a given time, be one correct interpretation, but every new generation must relate to the text from its particular points of view. Consequently, a document can not have a universally recognised, finite subject matter, such as the "forms of knowledge" advocated by Langridge (1989, p. 69).
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
17
As Føllesdal et al. (1992, p. 92-93) notes, a text signifies what the established use of words and concepts allow. This will change over time. A subject analysis of a document is meaningless without due regard for the language game from which it originates. Family resemblances and indexing theory With the concept of family resemblances, Wittgenstein established that language has no defining properties. Rather it consists of a number of intertwined language games, which may share some characteristics. Words cannot be defined unambiguously since their use may, and will, differ with the game. The possible uses to which a word can be put at a given time depends on the language games, and so the range of its meanings depends on the differences among the games. The meaning of the word can thus not be given a priori, but, since meaning is use, rather a posteori. The consequence for a indexing theory is, that we cannot have a preconceived notion of the use and meaning of words and concepts. An indexing theory cannot consider these independently of their use. As Wittgenstein points out (1958, § 66) we have to go out and look. Regarding subject descriptions Blair (1990, p. 157) writes: "They cannot be defined either by reference to expected behaviours, or even to act in certain way... They are, in short, simply words which are used in a particular way in certain kinds of situations." (our italics). This by no means excludes that we may delimitate or more explicitly formulate the use of a word in a particular context. We believe that this is exactly the purpose of indexing theory regarding family resemblance. We have to clarify the relationships between the various language games and the words and concepts we want to understand, in order to realise the potential uses to which the latter can be put. Rule following and indexing theory With his concept of rule following, Wittgenstein stresses the close relationship between understanding and using a word and the ability to follow a rule. They are connected by virtue of being social phenomena rather than mental properties. Expressions may be used and rules may be used, but neither are self explanatory in the sense that the proper use is self evident. The correct use of both are defined by and takes place in a social practice. Within any one social practice, consisting of the life forms and their language game, there is agreement concerning the use. If the concept of rule following is applied to indexing theory, it is immediately apparent that indexing is a social practice, constituted by rules for the assignment of indexing terms. An indexing theory must attempt to create the foundation that will allow the coherent and explicit explanation of the social practice of term assignment which, with the words of Frohmann (1990, p. 97), “...depends, therefore, upon a preliminary understanding of the social practices constituting text retrieval in the actual, historically real social world.”. In other words, an explication of the rules of indexing.
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
18
This is necessary because a theory is meaningless if it cannot be stated clearly or communicated. This is where the connection between understanding and the following of a rule is of importance, since, as Wittgenstein writes: “Is what we call “obeying a rule” something that it would be possible for only one man to do, and to do only once in his life-This is of course a note on the grammar of the expression “to obey a rule”. It is not possible that there should have been only one occasion on which someone obeyed a rule. It is not possible that there should have been only one occasion on which a report was made, an order given or understood; and so on...To understand a sentence means to understand a language. To understand a language means to be master of a technique.” (Wittgenstein, 1958, § 199) If we are unable to state the indexing theory explicitly, we cannot justify the rules that govern the assigning of indexing terms. If we cannot explicate these rules, they are not rules at all; how can we communicate these rules so that others may under stand and use them, if we are unable to state them? If we are unable to put the rules into words, we cannot claim to have understood them, and following rules which we haven't comprehended won't do. Considering the pragmatic dimension of indexing, a theory must also include an awareness of which social practices, i.e. language games, the document might have potential to be of use to. The indexing rules which have to be created within LIS should, with Wittgenstein’s principle of rule following in mind, be firmly based on an indexing theory. Ideally, this theory should be the result of the general knowledge and insight of LIS. Consequently, LIS must acquire a thorough knowledge of language games: An understanding of why and how the need for knowledge and information arise within the particular language game, in its social, cultural and historical context. This knowledge is necessary because we index documents so that they may, at a later date, satisfy such information needs. But the information need cannot be understood unless the language game and its socio-cultural context is taken into consideration because they are the prerequisite for sense-making. A statement is meaningless unless viewed against its lingual and social backdrop. Consequently, the understanding of these contexts is vital to a theory of indexing, since it is basically a theory of meaning. Private language and indexing theory First of all, with the private language argument theories of automatic indexing must logically be rejected. These kind of indexing theories are considering the actual document, and not the social context which constitutes the document, as point of departure. In other words, in theories of automatic indexing meaning is assumed to be inherent in the individual document. But when speaking of private language in
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
19
relation to indexing theory it is not a question of how an individual (e.g. author/document, indexer or user) uses a word or concept, but how that word or concept is used in the particular language game. The consequence of the discussion above is that no individual, regardless of whether it is the author/document, the indexer or the user, can be the sole cognitive authority on matters of indexing or its theory. The cognitive authority is the yardstick against which the meaning of words and concept is measured. If an individual was the cognitive authority in matters indexing theoretical and linguistic, the result would be a private language. But, according to the later Wittgenstein, a private language is impossible because it cannot tell us anything about the proper use of a word. Since the use of a word is determined through social interaction, no one person can define it arbitrarily. Subsequently, an indexing theory cannot be based on the interpretation of the individual. Indexing and indexing theory is often based on the idea of an individual cognitive authority, very often the author. An example of this is Lancaster (1998, p.22) who states that: "It is the ideas dealt with by the author, rather than the words used, that must be indexed.". Thereby it is implied, that indexing should reflect the author’s intent. However, writing a document is a social act (cf. Bazerman, 1988, p. 10), it is a statement made through a particular medium, and so the ideas dealt with by the author is not for the author to decide. Since private language is an impossibility, the cognitive authority of the indexing theory must be the language game. Because we consider a theory of indexing a theory of language use, which is a public phenomenon, it cannot rest on a mentalistic/psychological foundation. It is exactly the public use of language that constitutes the language games, in which the meaning of the words blossom and makes itself accessible. If we accept that the language game is the cognitive authority of the indexing theory, the question arise: Whose language game? The author/documents? The users? The indexer’s? We shall answer through exclusion. The indexer should be an intermediary, whose job it is to help the document and the user meet. The language game of the indexer is irrelevant to this function. The language games of the users are too heterogeneous and dynamic to be a feasible foundation for indexing. So, the last man standing is the author. It is the parent language game of the author and the document which is the cognitive authority, under the assumption that they originate from the same game. We rest this on two points. First, we assume that the primary objective of any information system is to proactively mediate documents, i.e. the language game of the author/document. This is done, in part, through the indexing. Making the information system proactive is not the same as giving the users what they want, indeed we consider the two mutually exclusive. Accommodating the users may compromise the system by simply confirming their particular, subjective notions. Thereby the information system would in effect be trying to cater to the private languages of the users. Furthermore, there is no need to mediate what the users already know. The role of the information system must therefore be to promote the documents by reflecting their language games. We interpret this to be in line with Mai (1998, p. 240) when he points out that the
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
20
organisation and representation of knowledge takes place within a certain social practice, and that this practice is the crucial factor in the organisation and representation of knowledge. While the language game of the author/document is irrelevant to the actual use to which the document is put, it is of major importance in indexing theory. The social practice concerned with the representation of the said document cannot take into account the uses to which the document might be put, i.e. the language games it might be used in. In an indexing theory we need to understand the language game of a given document in order to predict its potential uses. For example different theoretical viewpoints present in a document may themselves serve as potentialities which can be of use in other social contexts; i.e. other language games. However, we need to understand what has fostered these theoretical viewpoints in order to predict the potential uses of a document. This is an understanding of the social contexts in which these theoretical viewpoints are born. To try to predict the potential uses of a certain document, i.e. to give rise to applications in other social contexts, is also to say something about the social context which fostered the document in question. Because we cannot predict some usage of a document out of the blue, there must be a ground which give rise to the potential uses and that is the social context of the document to be indexed. We believe that the particular language game of a given document is an expression of what constitutes the document; that is, an expression of the conditions under which meaning is produced, distributed and exchanged and this is exactly what makes indexing theory a theory of meaning. We can illustrate this with an elaboration of figure 1: Social context (language game)
Document
Conceptual analysis
Indexing language
Figure 2: Elaborated indexing model and process With this elaboration we try to underscore the importance of the language game of the document in an indexing process. Thus, we have developed a new step one, i.e. the step from the social context to the document which means that what was step one before is now step two and so on. In step two an interpretation takes place and in step three a translation. However, we got to understand the new step one in order to carry out step two and three. The indexing language will, naturally, reflect the social context which it serves and it is therefore through the particular indexing language the potential applications of the document will be expressed. Secondly, the words and concepts, and thereby their meanings, that is used by the author/documents, are conceived in the language game that they participate in. Therefore these games are the focus of the indexing theory. Understanding the language games is a prerequisite for the prediction of its informative potentials. Hartnack (1986) points out that philosophical problems, according to Philosophical Investigations, often are due to mismatching of language games, and that it is the task of
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
21
philosophy to clarify which concepts belongs to what game, and how they should be used. We believe that the object of an indexing theory is similar: It must facilitate the clarification of how the concepts and words of the documents are used by assigning them to the proper language games. The document is an image, a frozen reflection, of a dynamic language game at a particular point in time. The indexing will reflect this. This focus on the language game of the document doesn’t conflict with the pursuit of helping users. If the users cannot understand the indexing terms assigned to the document and its language game, they probably wont be able to use the document itself. The indexing we advocate will therefore aid users in finding the proper documents. Furthermore we believe that user guidance should be relegated to the supporting parts of the information system. It should not be a part of the indexing. Due to the intertextuality of documents, the language games of the author/documents must be considered more stable units than, for instance, the multitude of language games among the users. This stability is also brought about by the fact that a language game, in which all statements are made through documents, is a quite slow-paced one. The rules and meanings simply cannot change as fast as in a conversation. The focus on language games should not lead anyone to believe that we attribute no meaning to the document itself. This meaning lies in the fact that it is possible to make sense of the content of the document. It is this content that should be nuanced by representing the documents language game in the document representation. However, the meaning of a document in a theory of indexing, lies with the notion that the latter is a theory of meaning. Thus it must express knowledge of the conditions under which meaning is produced, distributed and exchanged. Since meaning must travel through a medium, this is also a theory of documents. That the language game of the author/document is the cognitive authority doesn't mean that one should proceed in the manner of classical hermeneutics, represented by for instance Schleiermacher. The basic assumption of classical hermeneutics was, simply put, that sense should be made of a work by charting both the inner and outer life of the author, as well as his/her intention with the document. This is also referred to as the empathic theory (Pahuus,1995, p. 112-113). Thereby classical hermeneutics committed what modern hermeneutics and literary critics term the intentional fallacy which is the inability to distinguish between an author’s intentions and motives for making the text (what Lancaster denote "the ideas dealt with"), and the meaning of it. When we consider the parent language game of the document as the cognitive authority, we are in fact focusing on the intertextuality of the documents. By intertextuality we are referring to the intersubjective, textual context which the document is a part of, i.e. the language game. Thereby we also argue, in line with and as a consequence of the private language argument, against mentalistic approaches to indexing theory because, as Frohmann (1990) has pointed out, mentalism conceals intertextuality. So we return to the relationship between language game and indexing theory. The subject analysis of the document is not concerned with its inherent properties, although the text composition may be of importance to the indexing theory.
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
22
However, the purpose of this paragraph has been to stress that a theory of indexing is a theory of meaning. Meaning is the prerequisite of communication, and the process of indexing is a communicative process. Furthermore, communication is a public enterprise. Since meaning is use, we believe that a theory of indexing is a theory of the social communication of meaning. Subsequently, the idea of private language is meaningless in connection with such a theory.
Conclusion We assume that an indexing theory is a theory of meaning. An indexing theory based on Philosophical Investigations shifts the theoretical focus from the document, to what constitutes the document: The language game. This doesn't mean that the document is meaningless in an indexing theory. A document can be made sense of, it can contribute to the creation and exchange of meaning which is the potential applications of a document. By conceptually representing the documents language game in the indexing, we are able to nuance this sense-making. The point of the document in an indexing theory is, that the theory must express the conditions under which meaning, and thereby documents, is produced, distributed and exchanged. Therefore the document is not the primary concern of the theory, but rather what it is applied to. Using the concepts of language games and private language, exception is taken to mentalistic notions of language. Indexing theories that focus on the individual are considered meaningless because language is a social phenomenon. The individual user is of no importance to a theory in the sense that the perceptions of the individual cannot and should not be incorporated. To do so would violate the private language argument. Nor can the theory be based on the language games of groups of users. These games are too heterogeneous and dynamic to be useful. An indexing theory based on Philosophical Investigations must divide the universe of knowledge into a number of relatively stable language games, which must be indexed separately. This is where the language philosophical aspect of the indexing theory is situated.
Notes 1. Here quoted from Fiske (1990) 2. Ultimo August 1999, Wittgenstein was cited 77 times in the subject category of “information science & library science” in the Social SciSearch citation database at DIALOG
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
23
References Bazerman, C. (1988). Shaping written knowledge. The genre and activity of the experimental article in science. Madison, Wisconsin: The University of Wisconsin Press. Benediktsson, D. (1989). Hermeneutics: Dimensions towards LIS thinking. Library and Information Science Research 11, 201-234 Blair, D. C. (1990). Language and representation in information retrieval. Amsterdam: Elsevier. Blair, D. C. (1992). Information retrieval and the philosophy of language. The Computer Journal, 35(3), 200-207 Brier, S. (1996a). “Cybersemiotics: A new paradigm in analyzing the Problems of knowledge organization and document retrieval in information science.” In: Peter Ingwersen og Niels Ole Pors (Ed.): Proceedings CoLIS 2: Second international conference on conceptions of library and information science: Integration in perspective. pp. 23-44. Copenhagen: Royal School of Librarianship. Brier, S. (1996b). Cybersemiotics: A new interdisciplinary development applied to the problems of knowledge organization and document retrieval in information science. Journal of Documentation, 52(3), 296-344 Buchanan, B. (1979). Theory of library classification. London: Clive Bingley. (Outlines of modern librarianship) Christensen, M. L. (1994). Hermeneutik - fortolkning og forståelse. (Hermeneutics interpretation and understanding) Biblioteksarbejde, 15(41), 25-40 Chu, C. M. & O’Brien, A. (1993). Subject analysis: the first critical stages in indexing. Journal of Information Science. 19, 439-454 Cornelius, I. (1996a). Information and interpretation. In: Peter Ingwersen og Niels Ole Pors (Ed.): Proceedings CoLIS 2: Second international conference on conceptions of library and information science: Integration in perspective. pp. 11-21. Copenhagen: Royal School of Librarianship. Cornelius, I. (1996b). Meaning and method in information studies. Norwood: Ablex Publishing Corporation, Norwood, New Jersey. Fiske, J. (1990). Introduction to communication studies. 2. nd. Ed. London: Routledge. Frohmann, B. (1990). Rules of indexing: A critique of mentalism in information retrieval theory. Journal of Documentation, 46(2), 81-101 Føllesdal, D; Walløe, L. & Elster, J. (1992). Politikens introduktion til moderne filosofi og videnskabsteori. (Introduction to modern philosophy and theory of science) København: Politikens Forlag. Hartnack, J. (1986). Wittgenstein and modern philosophy. Translation of: Wittgenstein og den moderne filosofi. Translated by Maurice Cranston. Notre Dame, Ind. : University of Notre Dame Press, xiii, Hjørland, B. (1992). The concept of 'subject' in information science. Journal of Documentation. 48(2), 172-200.
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
24
Hjørland, B. (1997). Subject representation and information seeking. An activitytheoretical approach to information science. Westport, Connecticut: Greenwood Press. Hjørland, B. (1998). Information retrieval, text composition and semantics. Knowledge Organization, 25, No. 1/2, 16-31 Hoel, I. (1992). Information science and hermeneutics: Should information science be interpreted as a historical and humanistic science? In: Pertti Vakkari & Blaise Cronin (Ed.): Conceptions of library and information science. Proceedings of the first CoLIS Conference, Tampere, August 1991. London: Taylor Graham. pp. 1-11. Hutchins, J. W. (1978): The concept of “aboutness” in subject indexing. Aslib Proceedings, 30, 172-181 Ingwersen, P. (1992). Information retrieval interaction. London: Taylor Graham. Karamüftüoglu, M. (1996). Semiotics of documentary information retrieval systems. In: Peter Ingwersen og Niels Ole Pors (Ed.): Proceedings CoLIS 2: Second international conference on conceptions of library and information science: Integration in perspective. Copenhagen: Royal School of Librarianship. pp. 8597. Karamüftüoglu, M. (1997). Designing language games in OKAPI. Journal of Documentation, 53(1), 69-73 Karamüftüoglu, M. (1998). Collaborative Information Retrieval: Toward a Social Informatics View of IR Interaction. Journal of the American Society of Information Science, 49 (12), 1070-1080 Kuhlthau, C. C. (1993). Seeking meaning: A process approach to library and information services. Norwood, NJ: Ablex Publishing Corporation. Lancaster, F. W. (1998). Indexing and abstracting in theory and practice. London: The Library Association. Langridge, D. W. (1989): Subject analysis: Principles and procedures. London: Bowker-Saur. Lübcke, P. (1996). Gadamer: Sandhed og metode. (Gadamer: Truth and method) In: Lübcke, Poul (Ed.) (1996): Vor tids filosofi: Engagement og forståelse (Philosophy of our time: Commitment and understanding). 6. oplag. København: Politikens Forlag. 432 s. pp. 163-177 Mai, J-E. (1998): Organization of knowledge: An interpretive approach. In: Information Science at the Dawn of the Next Millennium, Edited by Elaine G. Toms, D. Grant Campbell and Judy Dunn, 231-42. Proceedings of the 26th Annual Coference of the Canadian Association for Information Science. Toronto: Canadian Association for Information Science. McLachlan, H. V. (1981): Buchanan, Locke and Wittgenstein on classification. Journal of Information Science, 3, 191-195. Nedobity, W. (1989). Concepts versus meaning as reflected by the works of E. Wüster and L. Wittgenstein. International Classification, 16(1). 24-26 Pahuus, M. (1995) Hermeneutik. (Hermeneutics) In: Humanistisk Videnskabsteori (Humanistic theory of science). Ed. Finn Collin og Simo Køppe. Danmarks Radio Forlaget. pp. 109-139
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
25
Quinn, B. (1994). Recent theoretical approaches in classification and indexing. Knowledge Organization, 21(3), 140-147 Shannon, C. & Weaver, W. (1949). The mathematical theory of communication. Illinois: University of Illinois Press. Talja, S. (1997). Constituting “Information” and “User” as Research objects: A theory of knowledge formations as an alternative to the information mantheory. In: Pertti Vakkari, Reijo Savolainen & Brenda Dervin (red.): Information seeking in context. Proceedings of an international conference on research in information needs, seeking and use in different contexts, 14-16 august, Tampere, Finland. London: Taylor Graham. 467 s. pp. 67-80 Tuominen, K. (1997): User-centered discourse: An analysis of the subject positions of the user and the librarian. Library Quarterly, 67(4), 350-371 Warner, J. (1990): Semiotics, information science, documents and computers. Journal of Documentation, 46(1), 16-32 Wilson, P. (1968): Two kinds of power. An essay on bibliographic control. Berkeley: University of California Press. Wittgenstein, L. (1958). Philosophical investigations. New York: Macmillan Publishing. Originally published in 1953. Wittgenstein, L. (1961). Tractatus Logico-Philosophicus. Routledge. Translated by D. F. Pears & B. F. McGuiness. Originally published in 1922. Wittgenstein, L. (1969). On Certainty. New York: Harper and Row
Washington, D.C., 31 October 1999
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
Washington, D.C., 31 October 1999
26
Andersen & Christensen
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
27
Relevance Auras: Macro Patterns and Micro Scatter Terrence A. Brooks School of Library and Information Science University of Washington; Seattle, Washington, USA
Introduction Empirical analysis of relevance assessments can illuminate how different groups of readers perceive the relationship between bibliographic records and index terms. This experiment harvested relevance assessments from two groups: engineering students (here after "engineers") and library school students ("librarians"). These groups assessed the relevance relationships between bibliographic records and index terms for three literatures: engineering, psychology and education. Assessment included the indexer-selected term (the topically relevant term) as well as broader, narrower and related terms. Figures 1 - 8 show these terms arranged as twodimensional term domains. Positive relevance assessments plotted across the twodimensional term domains revealed regular patterns, here called "relevance auras." A relevance aura is a penumbra of positive relevance, emanating from a bibliographic records across a term domain of broader, narrower and related index terms. This experiment attempted to compare the relevance auras produced by engineers and librarians at both a macro and micro level of aggregation. Relevance auras appeared in data aggregating reader groups and literatures. Mico analyses of individual records, however, showed that relevance auras were ragged or did not develop. Agreement in relevance assessment appears on the individual term basis and often independently of the formation of a relevance aura. Relevance assessment Mizzaro's (1997) review of the history of relevance studies reveals both the centrality of the concept, as well as how little progress has been made in its definition and quantification. Green considers the use of the term relevance to be in "disarray" (1995, p. 647), citing problems of both theoretical definition and operational measurement. This experiment premises that assessing the relevance of an index term for a bibliographic record is an act of reading that occurs within a cultural context. Transactional reading theory suggests that no two readers will ever generate exactly the same meaning from a text because they bring different backgrounds to the act of
Washington, D.C., 31 October 1999
Brooks
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
28
interpretation (Straw, 1990). Hjorland and Albrechtsen (1995) stress the effect of language communities on the interpretation of text in their domain-analytic paradigm. This experiment also assumes that harvesting and comparing relevance assessments reveals, in part, the fundamental phenomenon of verbal scatter. In short, people tend to disagree about the names of things (Furnas, et al., 1983). Therefore it is not surprising to find online searchers disagreeing with the indexer's choice of terms. Farrow (1991) reminds us that there is no psychological basis for indexing. Thus, sources of variation in empirical experiments comparing relevance assessments include not only differences between language communities, but also disagreements with indexer-chosen terms. To search online engages one not only in a two-way "conversation" with a database system, but in a three-way conversation that includes a third speaker, the ghostly indexer. The indexer effect lingers as a confounding variable that interferes with the measurement of the differences between language communities. Standard thesauri and subject heading lists provide a starting point for the investigation of verbal scatter and relevance assessment. These tools conveniently arrange terms in relationships other than just topicality (Green, 1995). This experiment used narrower terms, broader terms and related terms, arranged in twodimensional term domains. Mapping the positive relevance assessments on to a twodimensional term domain creates a relevance aura. The Semantic Distance Model Brooks (1995, 1997, 1998) introduced the Semantic Distance Model (SDM) to explain consistent relationships observed in a series of experiments between relevance assessments and semantic distance. The semantic distance effect of the SDM suggests that relevance assessments decline systematically with greater semantic distance. The semantic direction effect of the SDM suggests that the distance to nonrelevance depends on the direction of assessment up or down a term hierarchy. Semantic distance is a psychological construct that has been used to locate concepts along various dimensions of meaning (Schvaneveldt, Durso & Mukherji, 1982). The following experiment found broader, narrower and related term relationships in the INSPEC Thesaurus, the Thesaurus of ERIC Descriptors, and the Thesaurus of Psychological Index Terms. Each source provided two term hierarchies five terms deep. The top term in each hierarchy located a bibliographic record (here called a "top" record). Similarly, the bottom term in the hierarchy located a "bottom" record. The related terms in the term domains were chosen in an arbitrary fashion. The first related term was simply the first listed related term of each term in the vertical hierarchy. In turn, the second related term was the first listed related term for the first related term. The third and fourth related terms were chosen in a similarly arbitrary fashion. Figures 1 - 8 show the two-dimensional term domains of the INSPEC
Washington, D.C., 31 October 1999
Brooks
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
29
Thesaurus and the Thesaurus of ERIC Descriptors . Each table exhibits either the top or bottom bibliographic record, the term hierarchies (in the left-most column), and the related terms that fill out the two-dimensional term domain. A computer program collected relevance assessments by randomly presenting each bibliographic record paired with each of the twenty associated index terms in each table. Averaging relevance assessments produced either a positive or negative mean relevance assessment. Relevance auras were constructed by mapping positive mean relevance assessments over the two-dimensional term domains. Research questions The semantic distance effect of the SDM predicts that relevance assessment will decline with semantic distance. This effect will be demonstrated if the indexerchosen "topical" index term is assessed most highly, while more distant terms will be systematically devalued. The semantic direction effect of the SDM predicts that the distance downward to nonrelevance is shorter than the distance upwards to nonrelevance. This effect will be demonstrated by comparing the distance to nonrelevance for top and bottom records. The distance to nonrelevance (here defined as negative mean relevance assessment) for top records should be shorter than the distance to nonrelevance for bottom records. This experiment hypothesizes that relevance auras will emanate from a bibliographic record in a consistent pattern. The aura of a top record should extend sideways and downwards in an arc across a two-dimensional term domain. The aura of a bottom record should extend sideways and upwards in an arc across a two-dimensional term domain. The research premise is that consistent differences in the relevance auras between librarians and engineers are evidence of differences in their discourse communities. Subjects University students were a readily available pool of experimental subjects. The librarian group was twenty-eight randomly chosen master-degree students of a school of library and information science. The engineers group was twenty-eight randomly chosen mechanical engineering students (eight doctoral students, eight master-degree students, and twelve bachelor-degree students). This experiment assumed that these students, primarily graduate students, represented different discourse communities. Experimental equipment and procedures A computer program presented a bibliographic record and a randomly chosen index term. Subjects expressed their relevance assessments by moving a light bar over an unmarked scale. Subjects evaluated either the top or bottom bibliographic record, but not both, for any given term domain. Relevance assessments were normalized to z scores and aggregated. Mean z scores were found for each index term in a two-
Washington, D.C., 31 October 1999
Brooks
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
30
dimensional term domain. Arbitrarily, positive mean z scores were defined as positive relevance assessment, and negative mean z scores were defined as negative relevance assessment. Evidence of the SDM was found by comparing mean relevance assessments. Relevance auras were created by mapping the pattern of positive mean relevance assessments on to the two-dimensional term domains.
Results The Semantic Distance Model Table 1 presents mean relevance assessments for the vertical dimension for the term hierarchies (the left-most column in each table). Beginning with the indexer-chosen term (0.97), the mean relevance assessments decline systematically (0.22, 0.17, 0.12). The aggregate data appear to display the semantic distance effect predicted by the SDM. Table 1 also aggregates the related term assessments for the top and bottom records for each table (this is the horizontal dimension extending from each top and bottom term). Beginning with the indexer-chosen term (0.97), the mean relevance assessments decline systematically (0.25, -0.07, -0.09). The presence of the semantic distance effect in both vertical and horizontal dimensions suggests the existence of well-formed relevance auras for aggregate data. The relevance aura of three literatures Relevance assessments from three literatures afforded the greatest level of abstraction, therefore the greatest possibility of observing well-formed relevance aura. Top records Table 2 displays the relevance aura of the top records of the education, engineering and psychology literatures. Relevance assessments decline systematically with semantic distance in both the horizontal and vertical dimensions. The relevance aura, defined as the portion of the table with positive relevance assessments are limited to just the vertical and horizontal dimensions. None of the internal descriptors were considered relevant. Thus, a well-formed relevance aura for top records did not appear. Bottom records Table 3 displays the relevance aura of bottom records for three literatures. Positive relevance assessments extend upward four semantic steps and one step horizontally. Positive relevance assessments fan outwards with some scatter. Bottom records for the aggregated data form a wide, but inconsistent relevance aura.
Washington, D.C., 31 October 1999
Brooks
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
31
The semantic direction effect of the SDM predicts that the distance downward to nonrelevance is shorter than the distance upwards to nonrelevance. Table 2 illustrates that nonrelevance is reached at two semantic steps below top records; Table 3 illustrates terms four semantic steps above bottom records are assessed as relevant. Therefore, as predicted, the distance downwards to nonrelevance is shorter than the distance upwards. Micro analysis of top records Engineering top records Figure 1 presents the relevance auras given by engineers and librarians for INSPEC record 5279409. There is great similarity in assessments by engineers and librarians for this record. Presumably both groups were influenced by the word "assembly" in the first sentence of the abstract. The librarians rated "Integrated Circuit Technology" positively, but it is not statistically different (p < .18) from the negative rating given this term by the engineers. Figure 2 presents the relevance auras given by engineers and librarians for INSPEC record 00916781. The relevance aura colors many cells in this two-dimensional term domain, although the symmetry is disturbed because both groups consider "Fission Reactors" as negative. The two groups differ in their assessment of "Atomic Beams", but the difference is not statistically significant (p < .27). There appears to be great overlap between librarians and engineers in their relevance assessments for these engineering top records, but the relevance assessments form ragged relevance auras. Education top records Figure 3 presents the relevance auras for ERIC record EJ516093. Both groups scattered positive relevance assessments throughout the tables. The five terms that produced disagreement were: Generative Grammar (p < .23), Anthropological Linguistics (p < .18), Syntax (p < .17), Cultural Context (p < .06) and Form Classes (Languages) (p < .06). Only the last two approach statistical significance. Figure 4 presents the relevance auras for ERIC record EJ515482. Here the librarians disagree with both the engineers and the indexers by assessing all terms as negative. The two terms separating the engineers and librarians are not statistically significant: Logic (p < .40) and Logical Thinking (p < .24). It appears that the engineering top records produced more aggreement between librarians and engineers than the education top records. The education records illustrated not only disagreement between the two groups, but also disagreement between librarians and indexers.
Washington, D.C., 31 October 1999
Brooks
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
32
Micro analysis of bottom records Engineering bottom records Figure 5 presents the relevance auras for INSPEC record 5285612. Both groups scatter positive relevance assessments throughout the tables with complete agreement. Both disagree with the indexer by rating the topically relevant term negatively. Figure 6 presents the relevance auras for INSPEC record 5282573. This record casts a broad relevance aura over many of the terms for both groups. Differences between the two groups are centered on only two terms: Fission (p < .53), and Charge Exchange (p < .05). The difference created by the last term attains statistical significance and implies a real difference in understanding what "Charge Exchange" may mean between the groups. Education bottom records Figure 7 presents the relevance auras for ERIC record ED207581. These auras tend to cluster around the topical term, with the exception of Transformation Generative Grammar (p < .01). Linguistic Theory (p 0 ∑ O( x1 ,..., x n ) = i =1 0 otherwise where each wi is a weight that determines the contribution of the corresponding input value xi, and θ is a threshold that the weighted combination of inputs must surpass to set the output to 1. The learning process in a perceptron involves choosing the best values of wi and θ based on the set of training examples.
Geometrically speaking, in two dimensions, the formula of the perceptron represents a line. Any point above the line makes the output of the perceptron 1. Perceptrons can represent many primitive boolean functions but they have the limitation that they can learn only linearly separable problems. In Figure 1 two classes (+ and -) are represented in a two dimensional space. Figure 1(a) shows an example of linearly separable categories, while Figure 1(b) shows an example of non-linearly separable
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
128
categories (observe that there is no way to separate the two categories using a single line).
x2 ++ + +
x2 +
+ -
-
--
x1
+ -
-
+
(a)
x1 +
(b)
Figure 1. (a) Linearly separable categories (b) Non linearly separable categories
Output layer
Hidden layer Input layer Figure 2. Multi-layered neural network Backpropagation Networks In contrast to the perceptron model presented previously, multi-layered networks have its neurons organized by layers and each neuron is fully connected to the neurons in the next layer (see Figure 2). This configuration allows multi-layered networks to be able to approximate almost any vector-valued function. Backpropagation networks are multi-layered networks in which the weights are learned using a gradient descent algorithm. Given a network with a fixed set of units and interconnections, the backpropagation algorithm employs a gradient descent strategy to find the set of weights that minimizes the squared error between the network output value and the correct values. The reader is referred to chapter 4 of (Mitchell, 1998) for more details about ANN and the backpropagation algorithm.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
Training examples
Document vector
129
Feature selection
Training epoch
no done?
yes
Figure 3. Training procedure for Text Categorization using a machine learning algorithms
Machine Learning and Text Categorization So far we have discussed the generalities of machine learning and in particular some details of neural networks. This section will present the general procedure for training a machine learning algorithm for text categorization. As mentioned before, text categorization can be characterized as a supervised learning problem. We have a set of examples (abstracts or full text documents) that have been correctly categorized (usually by human indexers). Given this set of training examples, we can train a machine learning method. Figure 3 shows the general procedure for training a machine learning algorithm to perform text categorization. Document Representation: At this point we assume that the documents are electronically available (title and abstract, or full text). We use the vector space model (Salton, 1983) to represent a document. First, all the words in the document are tokenized, filtered using a stop list (words such as articles, prepositions, common verbs, etc are discarded), and stemmed using Porter’s stemmer (Porter, 1980). Unique stems with its corresponding document frequency are kept. Each document is then represented by a vector:
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
130
D = ( x1 ,..., x m ) x i = tf * idf idf = log
N ni
where m is the total number of unique stems (terms) in the training collection, tf is the term frequency within the document, idf is the inverse document frequency, N is the total number documents in the training set, and ni is the number of documents in the training set that contain the term i. The vector space model allows us to represent a document as a real-valued vector that can be presented as an input to a neural network, or to any other machine learning algorithm. Feature Selection: Despite using stop list and stemming, which are techniques that reduce the number of terms, the size of the document vector is usually too large to be useful to train a machine learning algorithm. Performance and training time of many machine learning algorithms are closely related to the quality of the features used to represent the problem. In previous work (Ruiz and Srinivasan, 1998), we used a frequency-based method to reduce the number of terms. The number of terms or features, is an important factor that affects the convergence and training time of most machine learning algorithms. For this reason it is important to reduce the set of terms to an optimal subset that achieve the best performance. Two approaches for feature selection have been presented in the literature: the filter approach, and the wrapper approach (Liu & Motoda, 1998). The wrapper approach attempts to identify the best feature subset to use with a particular algorithm. For example, for a neural network the wrapper approach selects an initial subset and measures the performance of the network; then it generates an “improved set of features” and measures the performance of the network using this set. This process is repeated until it reaches a termination condition (either the improvement is below a predetermined value or the process has been repeated for a predefined number of iterations). The final set of features is then selected as the “best set”. The filter approach, which is more commonly used, attempts to assess the merits of the feature set from the data alone independent of the particular learning algorithm. The filtering approach selects a set of features using a ranking criterion, based on the training data. Once the feature set for the training set has been identified, the training process takes place by presenting each example (represented by its set of features) and letting the algorithm adjust its internal representation of the knowledge contained in the training set. After a pass of the whole training set, which is called an epoch, the algorithm
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
131
checks whether it has reached its learning goal. Some algorithms such as Bayesian learning algorithms need only a single epoch; others such as neural networks need multiple epochs to convert. The trained classifier is now ready to be used for categorizing a new document. The classifier is typically tested on a set of documents that are distinct from the training set. Since we know the correct classification for this test set, we can determine the effectiveness of the classifier.
Hierarchical Model Up to this point we have shown a general procedure to train and test a machine learning algorithm for text categorization. This procedure does not take into account the hierarchical structure of the indexing vocabulary. Vocabularies such as MeSH have associated relations that organize them in a hierarchical structure, i.e. using a parent child relation or a narrower term relation. These relations are built in the vocabulary to facilitate its organization and to help indexers. They usually reflect conceptual relationships in the domain. Except for few works most researchers in automatic text categorization have ignored these relations. We believe that since the arrangement of terms in a hierarchical tree reflects the conceptual structure of the domain, machine learning algorithms could take advantage of it and improve their performance. Indexing a document is a task wherein multiple categories are assigned to a single document. Although human indexers are effective in this, it is quite challenging for a machine learning algorithm. Some algorithms even make simplifying assumptions that the categorization task is binary and that a document cannot belong to more than one category. For example, the naive Bayesian learning approach assumes that a document belongs to a single category. This problem can be solved by building a single classifier for each category, in such a way that the learning algorithm learns to recognize whether or not a particular term (category) should be assigned to a document. This transforms a multiple category assignment problem into a multiple binary decision problem. We extend machine learning to consider the hierarchical structure of the indexing vocabulary by using a method inspired by the Hierarchical Mixture of Experts (HME) model (Jordan & Jacobs, 1993). HME is based on the divide-and-conquer principle. The main idea is to solve the problem by dividing it into smaller problems that are easier to solve, and then combine the solutions of the small problems to obtain the general solution. This principle has been successfully used to create algorithms that are elegant, and efficient. The HME model approaches the solution of a classification problem by dividing the input space into a nested sequence of regions and then
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
132
training smaller classifiers that are specialized in classifying a reduced domain. The HME model has two basic components: gating networks, and expert networks. These components are organized as a tree structure where the internal nodes are the gates, and the leaf nodes are the experts. The gating networks represent high level concepts. As their name implies, each gating network works as a gate to access nodes in the lower levels of the hierarchy. The expert networks, sited in the terminal nodes of the hierarchy, are specialized in recognizing documents corresponding to specific categories. The original HME model proposed by Jordan and Jacobs (1993) uses expectation maximization to train the gates, and linear perceptrons as experts. It works in a bottom up fashion. Initially all experts are activated and their output is combined by the parent gate. The output of the gate is then passed to the next upper level until it reaches the top level that will report the global answer. Figure 4, shows a schematic view of our model for a two level hierarchy. We use the same division of the space, as the HME model, using experts and gates but our training procedure is different since we don’t use expectation maximization. Another difference is that in our model each gate receives an input x, which is the document vector, and if the document contains the concept represented by that node the output of the network is set to 1 (true), else 0 (false). All the networks connected to an “open gate” (a gate whose output value is 1) are activated and thus the classification proceeds in a top down fashion. If a document reaches an expert node via the high level gates, then the expert decides whether or not the category that the expert “knows” should be assigned to the document.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
133
Figure 4. Hierarchical model The categorization task starts at the root node and the gate decides whether the most general concept is present in the document. If this is true, then all the second level nodes are activated and the process repeats until a leaf node is reached. Observe that only the experts connected to gates that have value 1 are activated. Implementation of the Hierarchical Model For this study we use the Medical Subject Headings (MeSH) subset of the Unified Medical Language System (UMLS) Metathesaurus as our indexing vocabulary, and the parent child relationship as defining the hierarchy of concepts. The UMLS Metathesaurus is made up of concepts. Each concept has one or more strings associated, of which one is the preferred string (usually the MeSH term). The UMLS Metathesaurus offers hierarchical relationships in which more general concepts are parents of narrower or more specific concepts (See Figure 5). In our model the gates represent the general concepts, while the experts represent specific categories (or strings). For the purpose of comparing results with other studies we will show the results obtained using only the subtree of Heart Diseases, but the method is general and can be applied to the whole set or any subset of UMLS. We use backpropagation neural networks for both experts and gates. The experts are trained using a selected
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
134
subset of representative examples, which will be discussed in more detail later, to recognize the presence or absence of a particular category in a document. The gates are trained to recognize whether or not any of the categories of the descendants is present in the document. Observe that this definition implies that for each category that corresponds to a high level node there are a gate network and an expert network. For example, for the category “Heart Diseases”, which is the root of our tree, there is a gate network that learns to recognize the general concept (all the documents that are about any of its descendants), and an expert network (that learns how to assign this specific category “Heart Diseases”).
Figure 5. UMLS hierarchical structure for the “Heart Diseases” subtree The backpropagation networks that we use have three layers. The input layer consists of a set of “n” features selected for each expert (or gate), the middle layer has “2n” hidden nodes, and the output layer is a single node. Given the best set of features and a training set, the backpropagation network learns to assign the category (or concept in the case of gates). Feature selection: As discussed before, feature selection is a critical step for automatic text categorization. Neural networks in particular require reduction of the feature set because the performance of the network and the cost of classification are sensitive to the size and quality of the input features used to train the network (Yang & Honovar,
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
135
1998). For this paper we will report our results using a filter approach based on a measure called Correlation Coefficient (Ng, Goh, & Low, 1997). The measure relates the occurrence of a term (in our case a stemmed word) in the relevant and the nonrelevant documents for each category. The formula is as follow: C ( w) =
( Nr + Nn − − N r − N n + ) N ( N r + + N r − )( N n + + N n − )( N r + + N n + )( N r − + N n − )
where Nr+( Nn+) is the number of relevant (non-relevant) texts in which the term w occurs, and Nr-( Nn-) is the number of relevant(non-relevant) texts in which the term w does not occur. This measure is derived from the χ2 measure presented by Schütze et al. (1995), where C2= χ2 . The correlation coefficient can be interpreted as a “oneside” χ2 measure. This method promotes features that have high frequency in the relevant documents but are rare in the non-relevant documents. When features are ranked by this method, the positive values correspond to features that indicate presence of the category while the negative values indicate absence of the category. According to empirical results from Ng, Goh & Low (1997) using the features from relevant documents that are indicative of membership to the category gives better results than using features that are indicative of non-membership. We tested this hypothesis for a few categories and our results also support it. Training set selection When working with large collections we have a considerable number of categorization examples. Theoretically the availability of a large number of examples seems to be beneficial for training a machine learning algorithm. However, when the number of possible categories is also considerably large (such as the UMLS Metathesaurus with 350,000 concepts), we face a situation in which each category is assigned to a small number of documents. This creates a situation in which each category has a relatively small number of positive examples and an overwhelming number of negative examples (every thing that has not been labeled with this category). When a machine learning algorithm is trained with such an unbalanced training set it will learn that the safest decision is not to assign the category because this is correct most of the time. The overwhelming number of negative examples hides the assignment function. To overcome this problem, an appropriate training set must be selected. We call this training subset the “category zone”. This concept is inspired by the query zone, which was proposed by Singhal, Mitra & Buckley (1997). They describe a method called “query zoning” which is based on the observation that in a large collection a query will have a set of documents that constitute its domain. Non-relevant documents that are outside the domain are easy to identify, but it is more difficult to differentiate between relevant and non-relevant documents in the query domain. We observe that in text categorization, each category also has its own domain and that training a machine learning algorithm with examples in this domain could improve the effectiveness of a specialized classifier. The problem is that the
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
136
category zone is not explicitly known. We use the following procedure to obtain an approximation of the category zone: 1. Take all positive examples and compute their centroid vector. Positive examples are those documents that have been assigned the category. 2. Using this centroid as a query, perform retrieval and obtain the top 10,000 documents. This subset will contain most of the positive examples and many negative examples that are “closely related” to the domain of the category. 3. Obtain the category zone by adding any unretrieved positive examples to the set obtained in the previous step. The use of this category zone as the training set for a specific category will speed up the training process and will also improve the effectiveness of the classifier.
Experimental Setting Our experiments use the OHSUMED collection (Hersh et al. 1994) which is a subset of the MEDLINE collection that contains 348,543 records from 1987 to 1991. Each record from this collection has several fields. We use the title, and abstract fields. In the training set we use the MeSH field which represents the manual indexing performed by professional indexers at the National Library of Medicine (NLM). From this collection we select the subset of 233,455 records that have title, abstract, and MeSH terms. We select the first four years for training (183,229 records) and the records from year 1991 (50,216 records) for testing. The subtree of Heart Diseases, which is the one we will explore in this study, contains 119 categories. These experimental settings are the same used by Lewis et al. (1996) and will allow us to compare our results with their method. Lewis et al. (1996) focussed their study on those categories that have at least 75 examples in the training collection obtaining a subset of 49 categories. In our previous work (Ruiz & Srinivasan, 1998) where we studied the minimum number of positive examples needed to properly train a neural network we found that we should have at least 20 positive examples in the training set. There are 71 categories in the “Heart Diseases” subtree that have at least 20 examples. For purpose of comparison we will report only the subset of 49 categories that have been used by Lewis et al. (1996). The 119 “Heart Diseases” categories form a 5 level hierarchy where the first level corresponds to the root node and the fifth level has only leaf nodes. The number of gates in each level starting from the root is 1, 11, 9, and 3. Not all branches have experts with the required number of examples, as a consequence the hierarchy for the
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
137
49 categories that have at least 75 examples is reduced to three levels with 1, 6, and 4 gates respectively (see Figure 6).
Figure 6. Hierarchy for the 49 categories that have at least 75 examples.
Evaluation Measurements Several measurements have been used in previous studies to evaluate performance of text categorization systems. The contingency table that relates the system assignments and the human indexer’s assignment (see table 1). Several measures have been defined in the artificial intelligence and information retrieval communities based on this contingency table (see table 2). Fβ and Break-Even Point (BEP) are two common measures in text categorization that combine recall and precision. BEP was proposed by Lewis (1992) and is defined as the point at which recall equals precision. Van Rijsbergen’s Fβ measure (van Rijsbergen, 1979) combines recall and precision into a single score according to the following formula: ( β 2 + 1) P × R Fβ = β 2P + R The value β ∈ [0,∞) defines the relative importance of recall and precision; F0 is the same as precision, F∞ is the same as recall. Intermediate values between 0 and ∞ are different weights assigned to recall and precision. The most common values assigned
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
138
to β are: 0.5 (recall half as important as precision), 1.0 (recall and precision equally important), 2.0 (recall twice as important as precision). None of the measures is perfect or appropriate for every problem. For example, recall (sensitivity), and precision (predictive value (+)) if used alone might show deceiving results, i.e. a system that assigns the category to every document will show perfect recall (1.0). Accuracy, which is a popular measure in machine learning, is appropriate when the number of positive examples and the number of negative examples are balanced, but in extreme conditions it might also be deceiving. If the number of negative examples is much larger than the number of positive examples, a system that assigns no documents to the category will obtain an almost perfect accuracy (close to 1). BEP also shows some problems. Usually the value of BEP has to be interpolated. If the values of recall and precision are too far apart then BEP will show values that are not achievable by the system. Also the point where recall equals precision is not necessarily desirable from the user’s perspective. Van Rijsbergen’s F measure is the best suited measure, but still has the drawback that it might be difficult for the user to define the relative importance of recall and precision. We will report F1 values because it will allows us to compare results with other researchers that use the same data set (Lewis et al., 1996). When working with several categories we report a macro-averaged value.
System assigns the category (Assigned positive R+) System doesn’t assign the category (Assigned negative R-)
Indexer assigns the category (class positive C+) a (True Positive) c (False Negative)
Indexer doesn’t assign the category (Class negative C-) b (False Positive) d (True Negative)
Table 1. Contingency table for binary decision classifications.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
139
Table 2. Efficiency Measures for Binary Classification defines in Information Retrieval (IR) and Artificial Intelligence (AI). IR Recall (R)
AI Sensitivity
Precision (P)
Predictive_value(+)
Fallout
Predictive_value (-) Accuracy
Error_rate
Formula a a+c a a+b b b+d a+c a+b+c+d b+c a+b+c+d
Experiments We want to address two questions with our experiments: (1) Does the hierarchical classifier based on our hierarchical model improve performance when compared to a non-hierarchical classifier? (2) How does our hierarchical method compare with other text categorization approaches? With these questions in mind we present a series of experiments using the OHSUMED collection. Baselines We will define two baselines to answer our research questions. The first base line is represented by a classical Rocchio classifier, which is described next. This will allow us to address the issue of comparing our hierarchical model against another approach. The second base line is a “flat” neural network to address our first research question. We also will cite works published with the same collection to answer the second research question. Rocchio Classifier Rocchio’s algorithm was developed in the mid 60’s to improve queries using relevance feedback. It has been proven to be one of the most successful feed forward algorithms. Rocchio (1971) showed that the optimal query vector is the difference vector of the centroid vectors for the relevant and the non-relevant documents. Salton & Buckley (1990) included the original query to preserve the focus of the query, and added coefficients to control the contribution of each component. The mathematical expression of this version is:
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
Qnew = αQorig + β
140
1 1 d −γ ∑ ∑d R d ∈rel N − R d ∉rel
where d is the weighted document vector, R=|Rel| is the number of relevant documents, and N is the total number of documents. Any negative component of the final vector Qnew is set to zero. Several techniques have been proposed to improve the effectiveness of Rocchio’s method: better weighting schemes (Singhal, Buckley & Mitra, 1996), query zoning (Singhal, Mitra & Buckley, 1997), and dynamic feedback optimization (Buckley & Salton, 1995). We use the first two of these techniques to get an “optimal” Rocchio classifier for text categorization. We use the same approach presented by Schapire, Singer & Singhal (1998) to build their baseline Rocchio text filter. Since there is no initial query in text categorization, the first term of Rocchio’s formula is set to zero. We use the centroid of the positive examples of the category to build the category zone. The following steps summarize our algorithm for a given category: 1. Create the centroid vector using Rocchio’s formula: as a preprocessing step discard stop words, stem the rest and weight them using the atn1 weighting scheme. Select the top ranked 100 terms. 2. Build the category zone: Use the centroid as initial query and retrieve the top 10,000 documents from the training set. Create the category zone by adding any unretrieved relevant documents. 3. Using the category zone, build a classifier by using Rocchio’s formula with α=0, β=8, γ=8. Observe that this will create a vector that is the difference between the centroid of the positive examples and the centroid of the negative examples in the query zone. Select the top 100 terms with their respective weights. This is the classifier vector. 4. Using the classifier vector obtained in the previous step, rank the full training collection according to the similarity with the classifier vector. A threshold (τ) on the similarity value that maximizes the F1 measure (described in the evaluation section) is selected. This is our optimal Rocchio classifier. 5. Evaluate the classifier on the test set: use the Rocchio’s classifier to rank the test collection and assign the category to all those documents above the threshold τ.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
141
Hierarchical Classifier The hierarchical classifier is represented in Figure 6. In order to train each expert network we first build its ``category zone'' using steps 1 and 2 of the Rocchio classifier2 as explained before. Feature selection is then applied on each category zone to extract the ``best'' set of features for each category. As described in section 4.2 we use the Correlation Coefficient for this. For each expert network a backpropagation neural network is trained using the corresponding category zone, and the selected set of features. Similarly, each gating network is also a backpropagation network, however the training subset is the combined category zones of its descendants in the hierarchy. Feature selection is performed on the previously mentioned subset to obtain the best set of features for the corresponding gate. Experts and gates are trained independently using the following parameters: learning rate = 0.5, error tolerance = 0.01, maximum number of cycles = 100. The training of each network takes about 30 minutes for an expert, and 90 minutes for a gating network using a HP-700 workstation. Using 12 workstations and a dynamic scheduling program specifically designed for this task we train the 71 experts (with at least 20 positive examples) and the 15 gating networks in about 6 to 7 hours. We tried several values for the number of input features of the networks (5, 10, 25, 50, 100, and 150 features). The best result was obtained using 25 input features. As mentioned before, the architecture of the backpropagation network for 25 inputs has 50 hidden units and a single output unit. The inputs are the tf×idf weights of each selected feature (term) in the document, where tf is the frequency of the term in the document, and idf is the inverse document frequency. Once experts and gates have been trained individually, we assemble them according to the hierarchical structure. Since the output of our networks is a real value between 0 and 1, we need to transform it to a binary decision. This step is called defuzzification. We do this by selecting a threshold that optimizes the F1 value for each category. We use the complete training set to select the optimal threshold. Since we are working with a modular hierarchical structure we have several choices to perform defuzzification. Our approach is to make a binary decision in each of the gates and then optimize the threshold on the experts using only those examples that reach the leaf nodes. Observe that computing the optimal threshold for binary decision in the gates and then the thresholds for the binary decision in the experts implies a multidimensional optimization problem. We decided to optimize the gates by grouping them into levels and finding the value of the threshold at each level that maximizes the average F1 value for all the experts. Each expert's threshold is optimized to maximize the F1 value of the examples in the training set that reach the expert.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
142
Once the optimal thresholds are set in the training set, the test set is processed. The values reported in Section 8 are the F1 values obtained in the test set. Flat Neural Network Classifier To assess the advantage gained by exploiting the hierarchical structure of the classification scheme, we build a flat neural network classifier. We tried building a single network that learns all the 49 categories that have at least 75 examples but after a long training time only a few iterations on the training set were completed and the network's performance was extremely low. We then decided to build a flat modular classifier that will combine the outputs of the 49 individual expert networks using a simple mixture model. In this mixture model the experts are trained independently and their outputs are combined to get a single output vector y(n). The defuzzification step is performed by optimizing the F1 value of each expert for the entire training set. The performance of this model is considerably higher than the performance of the single network. These are the values that we report in the next section for the flat neural network classifier.
Results and Analysis Table 3 shows the performance of the hierarchical model, against the Rocchio classifier, and the flat neural networks for the 49 categories with at least 75 examples. The flat classifier performs significantly better than Rocchio's classifier (47.9% of improvement in macro-averaged F1). In terms of class by class comparison the flat neural network performs better than Rocchio's classifier in 42 of the 49 categories. The addition of the hierarchical structure improves the results significantly (65.7% of improvement with respect to Rocchio classifier, 12% of improvement when compared to flat classifiers). The hierarchical classifier performs better than the flat classifier in 41 of the 49 categories. When compared to other reported results, the HME model achieved the same macroaverage F1 (0.50) as the exponentiated gradient (EG) algorithm reported by Lewis et al. (1996), but lower than that of the Widrow-Hoff Algorithm (0.55) reported in the same work. We would like to compare our results with the generalized instance set method proposed by Lam and Ho (1998). However they used a different subset for training and testing (They use year 1991 and take the first 33,478 documents for training and the last 16,738 documents for testing). Hence we need to repeat all the experiments using the same split in order to do a fair comparison. A detailed analysis of the behavior of the HME with respect to the flat neural network shows two facts. The defuzzification threshold for the hierarchical classifier is less than or equal to the threshold for the flat classifier in 45 of the 49 categories. This is an expected result because the intermediate layers perform a prefiltering of “bad
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
143
candidate texts”' hence the experts receive a smaller number of examples. Since the optimization process sets these thresholds to maximize the F1 values in the training set, when the “bad matches”' have been filtered the algorithm will be able to set a lower threshold that increases the number of true positive without incrementing significantly the number of false positives. Even if the threshold is not changed we can expect a higher performance if the hierarchy is good at filtering false positives. Table 4 shows the number of documents that pass each gate in the test set. The number of documents in the test collection is 50,216. The root node filters most of the documents and only 8,792 reach the nodes connected to the root. Rocchio (baseline)
Flat neural network (% of improvement) 0.4449 (47.9%)
Hierarchical neural network (% of improvement) 0.4984 (65.7%)
Macro-Averaged 0.3007 F1 Variance 0.0194 0.0320 0.0254 Median 0.2791 0.4642 0.5167 Table 3. Comparison of average performance of the 49 categories for the three classifiers. Level 1 (threshold=0.01) (root) Heart Diseases
Level 2 (threshold=0.005)
Level 3 (threshold=0.01)
1.1 Arrhythmia 1.1.1 Tachycardia 1.2 Endocarditis 1.3 Heart Defects Congenial 1.4 Heart Valve Diseases 1.5 Myocardial Diseases 1.6 Myocardial Ischemia
1.3.2 Heart Septal Defects
# of doc >= threshold 8792 1978 562 289 1238 188 1130 3401
1.6.3 Coronary Diseases 1.6.4 Myocardial Infarction
7052 5357 5893
Table 4. Number of documents that pass each gate in the test set.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
144
Conclusions This paper presents a machine learning method for text categorization that takes advantage of the hierarchical structure of the indexing vocabulary. The results indicate that the use of the hierarchical structure of the indexing vocabulary improves performance significantly. Our method is scalable to large test collections and vocabularies because it divides the problem into smaller tasks allowing a modular approach. The results obtained by our hierarchical model are comparable with those reported previously in the literature, and significantly better than the classical Rocchio classifier and our flat neural network classifier. Future work We plan to study this approach using other machine learning methods to verify whether the improvement obtained by including the hierarchical model will also be obtained using other categorization methods. We also will explore alternative methods for category zone approximation. It would also be interesting to explore alternative methods, such as expectation maximization, for training the hierarchical classifier. Finally an aspect that still remains to be tested is the effectiveness of our method in classifying full text documents. Given the effectiveness and speed of our classifiers this could be an important tool for real-time indexing of web pages in the medical domain.
Notes 1. SMART notation for weighting scheme consists of three letters that indicate the values for the term frequency, collection frequency, and normalization respectively. The first letter a is the augmented normalized term frequency (0.5+0.5*tf/maxtf) where tf is the raw term frequency, the second component t indicates that the inverse document frequency is used as a collection frequency factor, the third component indicates that no normalization is performed. 2. We use the SMART system for this task.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
145
References Apte, C. and Damerau, F. (1994) Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 3(12), 233-251. Buckley, C. and Salton, G. (1995). “Optimization of relevance feedback weights”. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 351—357). New York, NY: ACM press. Hersh, W., Buckley, C., Leone, T.J. and Hickam, D. (1994). “OHSUMED: An interactive retrieval evaluation and new large test collection for research”. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 192-201). New York, NY: ACM press. Humphrey, S. and Kapoor, A. (1988). The MedIndEx system: Research on interactive knowledge-based indexing of medical literature. Washington, D.C.: National Library of Medicine. Joachims, T. (1997). Text categorization with support vector machines: Learning with many relevant features. Technical Report LS-8 Report 23, University of Dortmund. Jordan, M. I. and Jacobs, R.A. (1993). Hierarchical Mixture of experts and the EM algorithm. Technical Report, Massachusetts Institute of Technology. Koller, D. and Sahami, M. (1998). “Hierarchically classifying documents using very few words”. In ICML-1997: Proceedings of the 14th International Conference on Machine Learning, (pp 170-178). New York, NY: AAAI press. Lam, W. and Ho, C. Y. (1998). “Using a generalized instance set for automatic text categorization”. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 81-89). New York, NY: ACM press. Lewis, D. D. (1992). “An evaluation of phrasal and clustered representations on a text categorization task”. In Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 37— 50). New York, NY: ACM press. Lewis, D. and Ringuette, M. (1994). “Comparison of two learning algorithms for text categorization”. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR’94). Lewis, D. D., Schapire, R. E., Callan, J.P., and Papka, R. (1996). “Training algorithms for linear classifiers”. In Proceedings of he 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 298-303). New York, NY: ACM press. Liu, H. and Motoda, H. (1998). “Less is more”. In H. Liu and H. Motoda (Eds.) Feature extraction, construction and selection: A data mining perspective (chapter 1, pp. 3-12). New York: Kluwer Academic Publisher. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and Development 2 (April): 159-65.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
146
McCallum, A, Rosenfeld, R., Mitchell, T. and Ng, Y. (1998). “Improving text classification by shrinkage in a hierarchy of classes”. In Proceedings of the 15th International Conference on Machine Learning. New York, NY: AAAI. Mitcell, T. (1998). Machine Learning. Boston, MA, McGraw-Hill. Mladènic, D. (1998). Machine learning on non-homogeneous distributed text data. PhD. Thesis, University of Ljibljana, Faculty of Computer and Information Science, Ljubljana, Slovenia. Moulinier, I. (1997). Is learning bias an issue on text categorization? Technical Report, LaFORIA-LIP6, Universite Paris VI. Ng, H. T., Goh, W. B., and Low, K. L. (1997). “Feature selection, perceptron learning, and a usability case study for text categorization”. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 67-73). New York, NY: ACM press. Porter, M.F. (1980) An Algorithm for Suffix Stripping. Program, 14(3), 130-137. Rocchio, J. J. (1971). “Relevance feedback in information retrieval”. In G. Salton (Ed.) The SMART Retrieval System: Experiments in Automatic Document Processing (pp.313—323). Englewood Cliff, NJ, Prentice Hall. Ruiz, M. E. and Srinivasan, P. (1998). “Automatic text categorization using neural networks”. In E. Efthimiadis (Ed.) Advances in Classification Research Volume 8 (pp. 59-72). Melford, NJ:Information Today, Inc. Salton, G and Buckley, C (1990) Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science. 41, 288297. Salton, G and McGill M.J. (1983). Introduction to modern information retrieval. New York, NY, McGraw-Hill. Schapire, R. E., Singer, Y. and Singhal, A. (1998). “Boosting and rocchio applied to text filtering”. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 215-223). New York, NY: ACM press. Singhal, A., Buckley, C. and Mitra, M. (1996). Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp.21-29). New York,NY: ACM press. Singhal, A., Mitra, M. and Buckley, C. (1997). “Learning routing queries in a query zone”. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 25—32). New York, NY: ACM press. van Rijsbergen, C. J. (1979). Information Retrieval. Second edition, London, Butterworths. Weiner, E., Pedersen, J. O. and Weigend A. S. (1995). “A neural network approach to topic spotting”. In Proceedings of SDAIR'95 (pp. 317—332). Yang, J. and Honovar, V. (1998). “Feature subset selection using a genetic algorithm”. In H. Liu and H. Motoda (Eds.) Feature Extraction, Construction
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
147
and Selection: A Data Mining Perspective (pp. 117-136). New York, NY: Kluwer Academic Publishers. Yang, Y. (1999) An evaluation of statistical approaches to text categorization. Information Retrieval , 1(1), 69-90.
Washington, D.C., 31 October 1999
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
Washington, D.C., 31 October 1999
148
Ruiz & Srinivasan
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
149
Local Practice and the Growth of Knowledge: Decisions in Subject Access to Digitized Images Judith Weedman School of Library and Information Science San José State University, Fullerton, California, USA
Abstract This paper reports a pilot study of image digitization projects and the subject access they provide. It examines the factors which lead to undertaking a project, decisions about what images will be digitized, and the use of standard vocabularies and locally created vocabularies. The study considers these developments as examples of the growth of professional knowledge.
Introduction This paper explores the growth of a professional knowledge base in terms of the interaction between formal, published standards and the ways that professional tasks are carried out “on the ground,” in specific institutions. It defines a “professional knowledge base” as consisting of both the facts and theory codified in texts, journal articles, and published standards of practice and the day-to-day situated actions of professionals in the field. It takes the opportunity represented by the current period of rapid professional knowledge growth in the area of image digitization to explore the relationship between the two. Improvements in technologies of compression and storage, network bandwidth, and display resolution, coupled with falling costs and the availability of grant money, have resulted in a burst of image digitization projects. Many libraries, museums, and other organizations have responded by making portions of their collections available to their users in this way; it is estimated that there may currently be literally thousands of such projects underway in the United States. The technologies and labor costs are inexpensive enough to bring such projects within the range of many organizations with modest financial resources, but are also still costly enough that converting entire collections wholesale is not an option. Therefore, decisions have to be made about what portions of the collection are appropriate for digitization, what funding sources will be sought, and how collections will be maintained and enlarged over time. The research described is a pilot project for a larger study of small-scale image digitization projects and the decisions about subject access as examples of growth in a professional knowledge base. The pilot project used questionnaires
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
150
completed by 15 respondents and follow-up interviews with eight individuals. The central concern is to understand decisions made in these projects about what to make physically accessible digitally and how to make it intellectually accessible through subject representation in terms of the relationship between standards and local practice. The collections represented here primarily contain surrogates (slides) of works of art or architecture. This is in part the result of the fact that the study is concerned with images, and a large number of image collections naturally represent the visual arts. It is also a reflection of the fact that most of the sample was drawn from attendees at the 1999 annual conference of the Visual Resources Association, and VRA members tend to come from museums and university art and architecture departments. There are two exceptions, however: an archives of historical photographs in the Southwest and an archives of slides documenting a feminist art organization. Because this is a pilot study, the research questions are primarily exploratory and descriptive: 1) 2) 3) 4) 5)
What factors drive the decision to undertake a digitization project? What factors drive decisions about what to digitize? What subject access decisions are being made? How are users brought into the process? What are the formal and informal communication channels which provide for the growth of knowledge?
Literature review: background and context Growth of knowledge The concept of a professional knowledge base goes back to early sociological definitions of professions as unique occupations (Parsons, 1939; Greenwood, 1962; Goode, 1969) and has been carried into contemporary information science in such research as that by White and McCain (1998), which identifies clusters within disciplinary knowledge and Sydney Pierce’s (1987) exploration of the knowledge structures of professional literatures. Such knowledge is inherently provisional. The summations of professional knowledge presented in books and journals are subject to argument (as in Ellsworth Mason’s famous challenge to library automation, “Great Gas Bubble Prick’d; or, Computers Revealed -- by a Gentleman of Quality (1971)) and to evolution (as with AACR and AACR2). The knowledge embodied in the minds and practices of the profession’s members is also fluid. It is built over time in webs of ideas and experiences from professional education, the literature, conferences, vendors’ sales representatives, colleagues, and just trying things out – and grows sometimes in
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
151
advance of and sometimes behind the knowledge captured in publications. Often this knowledge growth consists of casual “let’s try it this way” changes, taking place in on-the-fly insights and experiments; they may or may not ever be communicated to colleagues, and may or may not be channeled back into the public arena. Professional association conferences function midway between the public standard and local practice, sometimes bringing authorities in as session presenters and sometimes giving local experts the opportunity to tell “how we done it good” in a specific setting. The growth in a profession’s knowledge base also occurs as new entrants to the field are educated, increasing the number of individuals in whom knowledge is instantiated and thus its variety (Weedman, 1999). As Suchman (1987) argues, all knowledge is situated. The knowledge base of a profession is never a monolith, and never fully reified; it takes various forms in various locales, and grows at different rates and in different, not necessarily consistent, directions. Diffusion of an innovation is one form of knowledge growth. Rogers’s (1995) classic work on this subject identifies five characteristics of innovations which influence their adoption: (1) relative advantage, the degree to which the innovation is perceived as better than the idea it supercedes, in terms of economics, prestige, convenience, or other qualities, (2) compatibility, the extent to which the innovation is perceived as consistent with existing values, experiences, knowledge, and needs, (3) complexity, the perception of how difficult it is to understand or implement, (4) trialability, the extent to which it is easy to experiment with the innovation in small steps, and (5) observability, the extent to which individuals can see the implementation of others and the results. Related to the observability of an innovation is the structure of formal and informal relationships in which an individual or organization is located. This social network structure (Wellman and Berkowitz, 1988; Wasserman and Faust, 1993; Rogers and Kincaid, 1981) determines the knowledge which can be brought into an organization; communication channels must exist along which the new idea or practice can gravel. Boundary-spanning communication is a central feature of innovation, as it allows new ideas to enter a system. Boundary-spanning includes a librarian working with technicians from an information technology department within the parent organization, which involves crossing both organizational and professional knowledge boundaries, or a librarian in one institution discussing his imaging project with another librarian during a break between meetings. Although the boundary spanning and diffusion literature has concentrated on informal communication, formal communication channels such as professional conferences and journals are also critical in bringing innovations into an individual’s awareness (Weedman, 1992). External funding sources are another formal communication channel; sometimes a grant announcement is the first encounter with the possibility of a particular type of project. Granting institutions often have priorities which can influence the direction a funded project takes. Two representative initiatives have been undertaken by the Library of Congress and the Getty Center. The Library of Congress’s National
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
152
Digital Libraries Project offers competitive funding for academic and public libraries, museums, and archives to create digital collections of primary resources which are significant for understanding United States history and culture. The Getty Center’s concern for preservation of cultural heritage information has led to such initiatives as L.A. Culture Net, a 3-year (1996-1999) pilot project to “organize online cultural resources in Los Angeles for people of all ages and walks of life. These grantmaking institutions may provide training, may disseminate information, and may bring together participants to exchange insights and experiences. Professional standards As Bowker and Star (1997) state, decisions about subject access determine “what will be visible within a system (and of course what will thus be invisible).” Shatford (1986) wrote thirteen years ago that there was a need for determining “which … subjects are most important, which ones we should index first, which ones we should index secondarily, and which ones should not be indexed at all” (p. 54). These are issues with which designers of digitized collections must struggle. What constitutes “subject” is the first set of issues which must be addressed. Discussions frequently refer to the work of Sara Shatford (1986) and Karen Markey (1986), both of whom use Erwin Panofksy’s three levels of meaning in art as a foundation. At the first level, the pre-iconographic level, subject is considered in terms of the generic description of the objects and items represented -- a woman and children, for instance. The subject focuses on what is represented, what it is a representation of. It is also possible to consider what the representation is about: Dorothea Lange’s Migrant Mother is a picture of a woman and children, which might be about strength or determination. Shatford uses the term “mood” for the preiconographic analysis of about. The second, iconographic, level is a cultural analysis of subject, recognition that a man in a painting is Moses (of) and that it is about escape from slavery; Shatford refers to symbolic meanings and abstract concepts that are communicated by images in the picture. The third level is “iconology,” which involves interpretation; iconology is a synthesis of the other two levels with the artistic, social, and cultural context. The three levels can also be labeled description, analysis, and interpretation. (These are slippery distinctions which this paper will not attempt to resolve.) Shatford suggests that the first two levels can be indexed, but perhaps or probably not the third, at least with any consistency. O’Connor and O'Connor’s recent work (1999) on users’ descriptions of images supports Shatford’s position; they found that users’ interpretations of emotion were often directly opposed one person describing an image as “lovely” while another used the term “depressing.” The Visual Resources Association Core Categories for Visual Resources (Visual Resources Association, 1997) uses three of these five levels -- (1) the objective description of what is depicted (e.g., a man in uniform), (2) the identification of the subject (e.g., George Washington), and (3) the deeper meaning or interpretation (“Washington stands in classical pose and leans upon a bundle of rods that signify the
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
153
Roman Magistrate -- thus associating Washington with great and powerful Roman magistrates of antiquity”). It is also necessary to distinguish clearly between the original work of art and a visual document which reproduces the work or some part of the work. Thus the subject of the painting Mona Lisa is a woman, but the subject of the slide of the Mona Lisa might be said to be the work of art itself. The VRA’s core categories clearly separate the categories pertaining to the work from the categories pertaining to the visual document. The distinction is not, however, as clear-cut as it sounds; even participants in the VRA’s Vision Project had difficulty addressing the separate categories with consistency (discussion, Lanzi, 1999). The existing standards for subject or content access are not stable. As the definition of what the core elements for description should be is still being refined, there is also no single vocabulary providing data values for those elements which would compare to the Library of Congress Subject Headings or Sears in its ability to provide uniform subject access to visual resources. There are several vocabularies in existence. The Art and Architecture Thesaurus (Petersen, 1990) is an indexing vocabulary initially developed for textual materials about physical objects and images, but it is increasingly used to manage collections of the objects and images themselves (Rasmussen, 1997). It covers antiquity to the present, containing 120,000 terms. The AAT is structured and covers facets such as Physical Attributes, Styles, Materials, and Objects. It provides vocabulary “for all the characteristics of art works … except for subjects” (Layne, 1994, p. 32). There are two primary vocabularies which do address subject. One is the Library of Congress’s Thesaurus for Graphic Materials 1 (Library of Congress, 1995), which contains 6,204 postable terms with an additional 4,324 entry vocabulary terms. Although it does not include art historical and iconographic concepts, TGM1 “does supply terms for abstract ideas represented in certain types of images.” The introductory matter describes the difference between what an art work is of (what it depicts) and what it is about (the underlying intent or theme) and states that “subject cataloging must take into account both of these aspects if it is to satisfy as many search queries as possible.” ICONCLASS was developed specifically for iconography as an area within art history (Rasmussen, 1997); it contains 24,000 “definitions of objects, persons, events, situations, and abstract ideas” which reflect “themes and subjects in works of art” (ICONCLASS Research and Development Group, 1999). Because people and places can be subjects of works of art, the Getty’s Union List of Artists’ Names (Getty Research Institute, 1999a) and Thesaurus of Geographic Names (Getty Research Institute, 1999b) are also important subject vocabularies, along with the Library of Congress’s Name Authorities.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
154
The Visual Resource Association’s standards recommend all of these vocabularies: AAT and LCSH for subject, ICONCLASS and TGM1 for iconographic themes; LC Name Authorities and ULAN for persons or groups, and TGN and LCSH for geographic places. It’s clear from the sheer number of vocabularies that none have truly become the professional standard. Markey (1988) described the proliferation of separate local databases ten years ago, and noted then that the work of LC in developing cataloging rules, the adaptation of the MARC record for graphic materials, and the availability of the Art and Architecture Thesaurus might foster consistency among collections. No recent research has explored whether there has been any increase in consistency in the last decade, nor whether the architecture of the World Wide Web has resulted in more standardization, though clearly the growth of collections on the Web has given new impetus to those organizations developing the standards. Rasmussen (1997) notes as well that many of the collections and vocabularies are being built by people outside the field of Library and Information Science. Professional values are reflected in approaches to subject access; Jain (1997, p. 32) notes that “in this interdisciplinary field, researchers have generally focused on issues related to their own disciplines.” Librarians have a long tradition of organization of information. The museum community, by contrast, has traditionally operated on principles of rugged individualism, and has only recently begun work on standards; museums tend to prioritize local and collection management needs over standardization. People with backgrounds in technology management often approach solutions to problems of organization technologically (Chang, Smith, and Meng, 1997; Huang, Mehrota, and Ramchandran, 1997). There is a variety of knowledge bases and professional value systems informing contemporary image digitalization projects. At the same time that grassroots digitization projects are springing up everywhere, and vocabularies and ontologies (as they are known in Computer Science) are proliferating, professional organizations are seeking to create metadata standards that will unify and provide for future usability and compatibility of digital visual resource collections. Four representative efforts are the Museum Educational Site Licensing Project (Trant, Jennifer, 1997; MESL, 1996), the CNI/OCLC Image Metadata Workshop (Weibel and Miller, 1997), and the Vision Project of the Visual Resources Association and the Research Libraries Group (The Vision Project, n.d.), and the MPEG (MPEG-2, 1996) standards for visual content description of moving images. It is not clear yet what impact these standards have had on practice in the field.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
155
Method The intent of this pilot project was to increase my knowledge of some of the environments in which digitization projects are being carried out and to explore research questions which seemed potentially useful. Fifteen individuals formed a convenience sample. One was known to me as having undertaken digitization projects, although I knew little about her work. Fourteen were attendees at the 1999 annual conference of the Visual Resources Association; I put out copies of a one-page questionnaire on the display table in the conference registration area, with a request that individuals engaged in or considering digitization projects complete it and return it to me. Semi-structured interviews lasting one to two hours were conducted by telephone with seven of the questionnaire respondents, and a two-hour interview was conducted on site with one individual. The sample is entirely inappropriate for any inferential statistics; but the variation within the sample suggests that useful insights may be gained. Miles and Huberman (1994) note that to get to the conceptual construct in a qualitative research project, a sample needs to “provide different instances of it, in different places, with different people. The prime concern is with the conditions under which the construct or theory operates” (p.29). The respondents included seven university art collections, two university architecture collections, one university humanities collection, one learning/teaching support unit, three museums, and one archives. They came from 10 states and 2 countries. The goals of the projects were varied. Some projects made the images available only for instructional use -- scanned at a high enough resolution that they could be projected during lectures or at a workstation in an individual tutorial. Other projects used the images solely for intellectual access to the original; thumbnails were linked to the database to provide additional information for the patron seeking a particular image or kind of image. Some projects did both in tandem. The respondents provided a rich introduction to the array of solutions which are evolving in local practice to the problems of subject access to images.
The respondents and their projects Ten of the 15 respondents were slide librarians or curators in universities, seven in schools or departments of art, two in colleges of architecture, and one in a unit providing technological support for learning and teaching. One of the respondents was the librarian for a college of art and design. Three were in museums of art, and one was in a state records center and archives. Table 1 provides summary information for the interviewees. In this table, and throughout the rest of the text and tables, projects are identified by a number followed by a U or M (to indicate a
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
156
university or museum environment), and art or arch to indicate whether the collection subject matter was primarily art or architecture. Size of total collection Size of digitized collection 1-U-art
2-M-art
430,000 3 150,000 2,500
3-U-art 4-U-arch
95,000
5-M-art
11,750
6-U-art
100,000 1: 800 (200 per syllabus) 2: 300
7-U-arch
300,000 2,000 1500
8-U-art
Project description
MARC records for images in university OPAC; links to images database/online catalog with linked images images on CD-ROMs, eventual link to database/catalog 1: instructional package of PowerPoint slides 2: imagebase to be used in resource sharing project pilot project: digitize images and link to MARC records 1: 4 syllabi put up on web, images for each unit linked from syllabus 2: imagebase pilot project: imagebase web database, one of several regional databases available through the site
Images to be digitized
entire collection, eventually. entire collection, eventually slides used in undergraduate and graduate courses 1: images for review for exams (undergraduate) 2: donated collection of landscape images images from Registrar’s collection (not the main slide collection) 1: most important images for each period, medium covered in course 2: photography collection slides used in survey course subset of a collection dealing with women’s culture
Table 1. The Projects The job titles of respondents included: archivist, photoarchivist, center manager, project manager, digital project coordinator, humanities curator, slide curator (two individuals), visual resources librarian (two), and visual resources curator (three). For convenience, the most common term, curator, will usually be used when referring to the respondents. Years in this particular job ranged from one to thirty, with a mode of 7 and a mean of 7.5. All had degrees in art or architecture. Four of the eight had MLIS degrees, and each had a second masters in arts administration or art history; one had a masters degree in museum studies. One of the curators had a masters degree in art history, one a degree in architecture, and one a Ph.D. in art history. The interviews were conducted with five of the individuals in universities, the librarian in the art college, and two of the museum slide librarians. Both of the museums had formal relationships with nearby colleges or universities and functioned as instructional collections in addition to their role serving the general public.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
157
The vast majority of images in the collections were slides, but there were also 4x5” transparencies and photographs. Art and architecture slides come from three sources. The first source is commercial packages, often produced to accompany courses. The second source is copy work from books -- slides created by photographing illustrations. The third source is original photography -- these may include slides of buildings shot by architects, photographs of events, collections of the work of artists whose medium is photography, photographs of faculty or student art (paintings, sculpture, performance art, etc.), and various other images. The proportions of slides from each source are different in each collection. The projects ranged from comprehensive plans to digitize the entire collection to small, experimental beginnings. The size of the collection to be digitized for the first project undertaken (some of the librarians had two separate projects underway) ranged from 200 to 430,000. The largest number of images actually digitized and in use was 2,000. To provide a sense of scale, in a university art department, a collection of 50,000 slides is considered a small collection. Decisions to digitize and decisions to provide subject access were loosely coupled. The decision to create an electronic database as a catalog to a collection could precede a decision to digitize by several years, could occur simultaneously, or could be planned as a future application for the digitized images. Physical access and intellectual access were not intrinsically related, although in fact they sometimes were undertaken together. Thirteen of the fifteen questionnaire respondents were providing subject access in some form to their digital images; all eight of the individuals interviewed were doing so. Of the eight interviewees, two began with images specifically for lectures and study and later developed projects to link images to a database. In the remainder of this paper, the discussion is based on the eight interviews completed, except where the fifteen questionnaire respondents are explicitly referred to.
What drives the decision to undertake a digitization project? Underlying all the discussions I conducted with practitioners was the sense that digital storage of image collections will eventually become as widespread as digital storage of documents. It is a technology which will change their collections, and change the ways in which the collections are used. Some of the respondents had a
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
158
sense of urgency about the technology; one described talking with other slide library curators in the area and realizing that “a lot of them were doing digitization. I realized how far behind we were.” The important issue was not the specific uses of the technology, but a more general sense that the profession is moving, changing, and that if one is part of the profession (or, at least, a respectable part of it), one grows with the profession. Another, more concrete, version of this feeling was expressed by the respondent who said “Eventually enough collections will be digitized, enough rooms will be wired, and enough people will expect digital images that we will reach critical mass and I would rather be a part of the mass than have it fall on me.” The interviews conducted with eight of the librarians examined the immediate impetus for undertaking a project, other enabling factors, and the goals of the project. In two cases, the availability of grant money provided the immediate impetus for the project. In one of these situations, the grant-making agency knew the school had a collection of cultural and historical interest, and sought the librarian out. In the other, another member of the academic community had received funds for a digitization project and his initial plan fell through, so he went looking for something to digitize. The librarians in this case were less than enthusiastic about having their time and resources drawn into someone else’s enthusiasms, but their administrator gave them little choice. In two cases, faculty were digitizing images for use in their classes, with no planning for bibliographic control, accessibility, or future needs. The librarians saw this as an opportunity to begin a project and impose some rationality and consistency on choices about organization, resolution, etc. Librarians went to the faculty to introduce the idea of using digitized images in two of the cases. In two cases, one in a university and one in a museum, the digitization project began with a comprehensive plan to make the entire collection available through an image database. In the university, the proximate cause was the incorporation of the departmental slide collection into the university library, with the requirement that access be provided through the OPAC; the incorporation coincided with the retirement of the slide librarian, so the new librarian was hired with the administration of the project in mind. In the museum, the librarian began with the textual data, getting the records into a database, and was waiting for the right software to add images. When she found a program that seemed suitable, and discovered the company would work with her in designing the data structure, that provided the opportunity to begin scanning. The compelling motivation for one of the curators was to create visibility for the library. “In a small college the faculty don’t see librarians as colleagues, or as part of the educational process. So I’ve tried to change the perception the whole time I’ve
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
159
been there. I needed to have more visibility for the library, and I thought the only way to get the visibility I want is to grab their attention in some kind of digital way. Over time, that could lead to increases in the budget and I would be able to create more resources.” The sense of the inevitability of this particular technology surfaced in all the interviews. Accompanying it was a feeling that the respondent didn’t know enough yet to start but that the time had come to jump in anyway. One said that her “biggest fear is that we will do something that we will wish we had done differently later.”
What factors drive decisions about what to digitize? Librarians and curators are in a situation where, although individual components of the technology are quite familiar, the use of it for providing access to an image collection and the implications are not. There is uncertainty about the best instructional uses, the best way to link images to the catalog, uses permitted by copyright law, how to plan now for future needs, what resolutions should be provided, and more. All of the interviewees started small, even those who planned from the beginning to reproduce the entire collection. As one said, “I’d been wanting to do some kind of digital work. I wanted to do an image database for the whole collection -- but I don’t have an assistant curator -- so I decided it was best just to pick something and jump in.” One of the respondents said: “Nobody has said, ‘Do this, and then this, and then this; this is how you do a project.’ I feel like I’m in this position where I have to create it myself. So the only way I can even conceive to create these things is small little steps. Where you do a little bit and you take another step and you see what the interest is and then you take on a little bit bigger project and you just keep building on it. That’s the only way I can think of to learn it, because I can’t afford to take a semester and go to some school or whatever. And nobody that I know around here is really teaching it that much.” Often the logical place to start is with an instructional package for a course rather than with an image base. This has the advantages of defining a discrete part of the collection and of having an established structure, a clearly definable objective, and a single individual to work with. One librarian began with a course syllabus which she put up on the web, with images linked for each unit. Another began with review tutorials, using text and images in kiosks which already ran PowerPoint slides for departmental announcements. A third began with a lecture that one of the faculty members had given on the decorative element in contemporary artists’ work which included the work of many of the faculty; she got permissions from the faculty members and digitized the slides of their work, to put on a CD along with the lecture and some introductory text.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
160
Others, who had planned from the beginning to make the entire collection available, started with as few as three or ten images. They were able to experiment with the software, see the effect of various resolutions, get a feel for all the pieces before putting them together. How were these first three or ten selected? A typical answer was provided by a museum librarian, who said: “Simple. The next exhibition.” Another factor that weighed heavily in the decision is copyright law, or the librarian’s understanding of copyright law. Who owned the rights to particular images and how easy or difficult it was to contact and obtain the permissions for reproduction affected the choice of images. One librarian started with a collection that was a donation to the university, because the library owned the rights. Another started with commercial images because the rights-holders were publishers and therefore easy to identify and locate (not always the case with original photography). A third was starting with copy work, because she felt it fell into a gray area in copyright law, whereas she knew she didn’t have the rights to commercial slides and the faculty who produced the original works had in many cases retired or moved on. Copyright affected both what images to digitize, and what resolution to use.
What subject access decisions are being made? This discussion will focus on the database and catalog projects, and omit those projects which were purely instructional in nature. See Table 2 for a summary of the data. Visual resource collections have traditionally been arranged physically, with no subject access (Small, 1991). Many materials in collections -- for instance, crafts -lack even titles. Slides are arranged geographically by nationality of creator or by location, or chronologically, and then by the creator. The slide label is often the full record for that slide; therefore, the information is brief. To enable information to be recorded, abbreviation authority lists have been created by the Visual Resources Association (Schuller, 1989) and coding systems have been developed. The SmithTansey system for architecture, for instance, uses a coded square of numbers and letters to represent attributes include time period, country, art form, style, artist’s name, medium, title, subject, part of building, and view. It is used for physical arrangement, and is, according to one of the respondents, indecipherable to the users of the collection; it does not allow searching by actual subject. The introduction of commercial database software, and the refinement of its functionality to allow printing of selected fields on slide labels, made possible the addition of subjects to the searchable information about collections. Six of the eight curators had built databases containing textual records for at least a portion of their collections before beginning digitization of the images themselves; for the other two, the ability to link images to the data provided the impetus to create a database.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
161
Three of the projects -- two in university art departments and one in a museum of art - were comprehensive; the intent was to create a database for the entire collection, at least from that point forward. The other five projects were limited in scope, focused on a single discrete collection within the collection as a whole. Three of the five were experimental, intended to allow the curator to explore the potential, the best approaches, and the technologies themselves, without committing the library to a comprehensive plan for the future. The other two projects were the result of unexpected grant money, and the scope of the project was tailored to the life of the grant. The three experimental projects were all preceded by robust instructional projects; there was a sense that these projects created an opportunity that was too important to pass up for linking those images to a database, though there were only limited resources available for doing so. One of the respondents said that discussions of digitizing additional images and connecting them to the database all began with the phrase “when we get the money…”. Another respondent lamented, “It’s a huge project -- there’s a lot of data entry -- and we have no real help. The students are doing slide accessioning, labeling, etc. It’s hard to find a big enough chunk of labor; hard to make any headway. I haven’t actually worked on this in the last ten months or so.” It is most useful to consider each of the projects in turn. The aspects of subject access to be discussed are the vocabularies used, the number of terms assigned, cataloging of the entire image versus detail (individual objects within the image), and responsibility for term assignment At the university which was integrating its visual resources collection into the library’s OPAC (1-Uart), the data structure was of course the MARC record. The librarian was hired specifically to implement the changeover from a departmental, unautomated collection to representation in the university’s online catalog. The expectation is that permissions will ultimately be sought for the entire collection, and all the images for which the permissions are obtained will be linked to their OPAC records. This collection uses AAT and LCSH, in different fields. The 6xx fields are the subject access fields; LCSH is used in these. The curator initially assigned up to ten terms per image, but the under-staffed technical services department imposed a six term limit. The average number of subject headings per item is four. In general, these terms address the image as a whole. AAT is used in the notes (5xx) fields. the terms from the thesaurus are incorporated into one or two, often very long, sentences. This field captures the specifics of the image, and the terms are selected on the basis of anticipated uses and queries. A typical sentence in the notes field might read “Mary is weaving on a tablet loom; view of Gothic interior arch, leaded glass windows; Joseph is sitting on stool with weaving tool.” The curator was not able to estimate how many AAT terms might be incorporated into such a sentence, but said that the sentences “can get pretty long.”
Washington, D.C., 31 October 1999
Weedman
Prisons -- Chain gang Costume -Prisoner Agriculture— Farming— U.S. Cartography -Aerial Buddhism – Symbols – Wheel of Law Allegory yes
African architecture American fiber art American Furniture European metal work Classical decorative art European print Asian painting
yes
1) in-house categories from physical arrangement 2) in-house Landscape Nonobjective Animal Ballet Narrative with figure
yes
in-house drawn from architectural dictionaries & reference books House, country Tree, oak
yes
Women’s culture Events Performance Feminist studio Workshop Public art
no
Washington, D.C., 31 October 1999 2-6 (estimate)
students
not decided
respondent
1) 1 2) multiple
respondent
15-20
students
respondent
yes
AAT in-house to supplement
2-4
1 )work type (in-house) 2) AAT, LCSH
cataloging cooperative
yes
AAT, ICONCLASS, LCSH, in-house to supplement
hierarchical
description
respondent
“as many as necessary to get all the important things in”
1) in house 2) AAT 3) ICONCLASS
1) subject 2) AAT 3)
Comprehensive database preceded digitization
in-house
MARC: 6xx
Comprehensive database preceded digitization
Comprehensive database preceded digitization; discrete digitization project has own database
1) subject 2) keyword
Discrete collection mgmt system preceded; new database in tandem with digitization
building type special feature
2-M-art
keyword
3-U-art
Discrete database and digitization in tandem
4-U-arch
Comprehensive database preceded digitization
5-M-art
database and digitization in tandem
6-U-art
7-U-arch
8-U-art
respondent
5xx: 5-12 6xx: 4
yes
5xx: ATT 6xx: LCSH supplement w/ in-house
MARC: 5xx 6xx
Comprehensive database and digitization in tandem
1-U-art
Subject terms assigned by
Number of terms typically assigned
Indiv. objects or details.
Example of in house terms (where available)
Vocabularies used
Fields
Project
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop 162
Table 2. Subject access
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
163
The curator found AAT to be “very ample,” and a good fit to the collection. Her expectation had been that both vocabularies would be unacceptably weak for nonWestern art, and would not be specific enough for item-level cataloging; in practice, she found that she rarely needed to supplement the terms. When she does supplement them, it is for something “very obscure,” and she uses terms from subject specialty dictionaries. The curator doing the other large and comprehensive project, at an art museum, had been providing subject access to the collection for three decades, and implemented an electronic database several years ago. She had a very different opinion of LCSH and AAT. She finds LCSH to be unusably broad, AAT to contain descriptive terms rather than true subject, and both, along with ICONCLASS, to be ethnocentric and biased regarding non-Western art. For a collection which is one-third non-Western, this is a significant problem. However, the curator has provided fields for AAT and ICONCLASS in the data structure. The primary subject field contains subject headings which she developed for the collection, which are loosely patterned on LCSH but more specific. Where LC provides the term “architecture, domestic,” the in-house vocabulary gives building type, building style, and location -- for instance, “Architecture - Residence - Georgian - Virginia.” ICONCLASS does address aboutness, and the curator expects that it may eventually become much more widely used than it is currently. However, she feels that the coding structure makes it extremely difficult to use and for the moment is not adding ICONCLASS terms to the records. Since the in-house vocabulary provides the primary subject access, the AAT vocabulary is used without modification. Term assignment is done by the curator. The other comprehensive project, 3-U-art, has its cataloging done by a collective, and the VRA core categories form the basis for the data structure. The subject field is hierarchical, and they select terms as needed from AAT, ICONCLASS, and LCSH; when the needed term is not in any of those three, they will add their own. The terms they find they need to add are the most specific terms -- for the first two levels in the hierarchy, appropriate terms are available. 4-U-arch, which is doing simultaneous database and imagebase projects on different computer platforms, is using AAT for the description field, embedded in one or two short phrases or sentences. The curator was unable to estimate the average number of terms used per record. She said that they do occasionally supplement the vocabulary, but wasn’t aware that the terms they need to add formed a pattern or tended to be of any particular nature. Students do the cataloging, and it is checked by the curator. As noted earlier, the digitized slides are being handled separately from the others; a truncated version of the primary database is in use, which contains only the information that can be printed on the slide, not the description field or additional access points, plus links to the images on a separate CD-ROM.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
164
5-M-art, the pilot project digitizing the discrete collection of the Registrar’s transparencies and slides, is also using MARC as a data structure. They are still in the process of working out subject access. In the 6xx fields, they will use an in-house vocabulary based on the pre-existing physical slide arrangement system. There are 26 terms used to represent work type -- architecture, fiber arts, furniture, photography, sculpture, and painting are representative examples. The subject headings will combine work type with department (American, Asian, European, Classics), followed by the creator’s name and date. The cataloging will be done by the respondent. An additional field will be created for LCSH subject headings They are also considering adding descriptions to the notes field -- perhaps 3 sentences or a short paragraph describing the literal content of the image (no interpretation), such as “young girl riding side saddle, hair in pony tail…” One issue in a museum is the divided nature of the responsibilities; the Registrar’s office manages the collection, provides the database of holdings, keeps records of the use of each object in exhibitions and loans, and does the actual hanging, while the curatorial staff plans the exhibits and deals with intellectual content. Decisions about notes and descriptions may overlap with curatorial responsibilities, requiring negotiation and agreement. The museum is considering the use of AAT terms, which have the advantages of being intended for such collections and of being used in the pre-existing collection management database; but concerns about conflict between AAT and LCSH terminology may preclude use of AAT. 6-U-art has two fields for subject access, referred to as subject and key word fields. The subject terms are drawn from the physical arrangement, which is first by work type (architecture, sculpture, painting), followed by a facet for country, then century, artist, and subject. “Subject” refers to the nature of what is depicted; landscape, animal, ballet, narrative with figure, nonobjective, etc. There is an authority list kept of terms assigned in the key word field. The image is analyzed as a whole. In 7-U-arch, there are two subject fields, one for building type and one for special features. The special features field can contain 15-20 words, drawn from various architectural dictionaries as needed. However, the terms are assigned by architecture students, who don’t have a great deal of architectural knowledge yet, and often don’t use the dictionaries and reference books. The records are checked by the curator, but this is time-consuming and often cannot be done as thoroughly as she would like. The curator would use AAT extensively if the library had a paper copy; they have found AAT cumbersome to use online, and Netscape crashes whenever they access the site. For 8-U-art, subject access is provided through key words. These were composed on the fly by the curator. They are designed to address the fact that the web search engine allows users to search this database simultaneously with other databases from a wide range of subject areas. “Women’s Culture,” for instance, is a key word for every slide, so that they can be found among images of cowboys, aqueducts, historical landmarks, and neighborhoods. The user group is broad and unpredictable,
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
165
making a general vocabulary more useful than a highly specialized feminist, historical, or art vocabulary. A list of terms was not maintained by the curator.
How are users brought into the process? Only one of the curators, 1-U-art, incorporated users directly into the planning process, and she did so because of the structure imposed by the university; a committee of 12 faculty members and the dean participated in setting policies for circulation, access, etc. The respondent found two advantages to working with the committee; it provided them with support in the Faculty Senate, and it gave them a strong base from which to publicize the collection. One of the other librarians planning a comprehensive catalog said that she would bring in users as soon as she had a web page design that she liked, and actively solicit opinions. In some cases, the faculty unknowingly played a role in the decision as to what to digitize, since it was the slides requested most frequently for course use -- the “monuments,” as one librarian called them -- that would be digitized first. Discussion of use of the collection focused on the teaching needs of faculty and students rather than on research needs. Several of the curators described professors coming in to pull the slides they always use for a particular course. Response to the innovation by users has been mixed. Students have been overwhelmingly enthusiastic, especially about tutorials which can be reviewed at individual workstations, rather than in lightboxes with groups of students huddled around trying to see. Faculty at the university which put their syllabi up and linked images to them were rapid converts to web technology; the librarian said that although at first they were very skeptical, “they’ve turned around so quickly I’m just shocked. They’re old world scholar types. Very wary. Now they come to me and ask for web pages. They were hesitant; now they’re very excited.” The slide collection added to the university library OPAC found a clear increase in usage by faculty and students from outside the art department. However, faculty are often reluctant to change their tried and true habits. There is also the physical appeal of the old way of doing things; “they love their slides and the photographs,” they know exactly where to find what they want, they enjoy the physical process of working with drawers of slides. In these cases, an imagebase goes largely unused. Another librarian began her work with digitization by putting faculty art up on a web page, hoping to generate enthusiasm and a sense of the potential of the medium, to “grab their attention in some kind of digital way.” She found, however, that “when it’s not part of a culture, it’s not the expectation … And I’ve put out newsletters, and spoken at the faculty assembly about what we’re doing, but I still don’t think they’ve really quite gotten it. They’re not beating down the door with their slides, and saying ‘Please include me’ yet. So we’re just trying to kind of gradually spread the word.”
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
166
What are the formal and informal communication channels which provide for the growth of knowledge? The sample of librarians and curators for this pilot project is skewed toward the more active members of the profession. Attendance at an out-of-state conference is in itself an indication that an individual has a certain level of professional involvement. The element of self-selection present, because the questionnaire was distributed passively, also has an effect on the results. Still, the individuals interviewed give an interesting look at the movement of ideas among members of a profession. Table 3 lists the communication channels cited. The individuals experimenting with digitization and linking images to textual representations found both formal and informal channels to be very useful. The Visual Resources Association was the association most frequently mentioned -- not surprisingly since all the respondents were originally contacted through the VRA. The formal presentations themselves were useful, and the ability to contact a presenter later and ask specific questions was equally, sometimes more, important. The Art Libraries Society of North America was also mentioned by 4 of the respondents. NINCH, the National Initiative for a Networked Cultural Heritage, and the Museum Computer Network were both mentioned -- though, interestingly, by one of the individuals in a university slide library, not a museum. Five of the respondents cited the VRA’s listserv as an important source of technical information, and a good place to obtain answers to specific questions. One respondent commented that she “could not have done the project without this listserv … it really is so important to a project like this when you’re all alone.”
Washington, D.C., 31 October 1999
Weedman
Washington, D.C., 31 October 1999
imaging software developer artists who’ve done CD-ROMs grant participants
VRA listerv ARLIS listserv
graphic design journals general news media
New Media Communication Arts
8-U-art (MLIS)
colleagues at VRA technical support people on campus
VRA
VRA listserv
7-U-arch (architect.)
NINCH website VRA listserv SECAC
Spectra (Museum Computer Network)
VRA Bulletin
6-U-art (art hist.)
follow-up from VRA conference
RLG’s DigiNews VRA’s VRA (natl) ARLIS (local)
Petersen and Molholt (1990). Beyond the Book McRae and White (1998). ArtMARC.
Art Documenta tion
5-M-art (MLIS)
2-M-art (art hist.)
librarians at her university
Getty’s on AAT, ULAN
ARLIS
VRA
VRA
Petersen and Barnett (1994). Guide to Indexing and Cataloging with the AAT. Besser and Trant (1995). Introduction to Imaging.
VRA listserv
nearby collection
1-U-art (MLIS)
VRA ARLIS
McRae and White (1998). ArtMARC Sourcebook Kenney and Chapman, (1996). Digital Imaging for Libraries and Archives.
Shatford Layne. Art articles in Documentati Art Cataloging & Documentation on Classisification Visual VRA Bulletin Quarterly, Documentation Art Library JASIS, The VRA Bulletin Journal Reference Librarian.
3-U-art (MLIS)
VRA listserv
4-U-arch (museum studies)
Colleagues
Workshops
Conferences*
Electronic*
Boo½ks
Journals, articles
Project (profl degree of curator)
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop 167
Acronyms: VRA = Visual Resources Association, ARLIS = Art Library Society of North America, NINCH = National Initiative for a Networked Cultural Heritage, SECAC = Southeast College Art Conference, RLG = Research Libraries Group.
Table 3. Communication channels important to digitization and subject access projects
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
168
The three individuals involved in more comprehensive projects found the associations and the listserv less useful; the majority of attention was focused on projects that were smaller in scope than their own. All three had contributed themselves to association programs or publications. Five of the respondents did a lot of reading at the beginnings of their projects. Four books were mentioned: ArtMARC Sourcebook (McRae and White, 1998), Beyond MARC (Petersen and Molholt, 1990), Guide to Indexing and Cataloging with the Art and Architecture Thesaurus (Petersen, 1994), Introduction to Imaging (Besser and Trant, 1995), and Digital Imaging for Libraries and Archives (Kenney and Chapman, 1996), by a group at Cornell who developed an imagebase. The journals published by the Visual Resources Association and ÅRLIS were mentioned by five of the respondents. As mentioned above, the three individuals involved in comprehensive projects were contributors to the national associations in various ways. They had made presentations at conferences, written journal articles and book chapters, and served on data standards or copyright committees. Two of the people doing smaller projects had presented at the national or local level as well. Two others were planning to do presentations when their projects were farther along. All the respondents also participated in informal exchanges about digitization, data structures, and copyright. There were visits to see what others were doing, conversations at conferences, queries posted to listservs, and interactions with individuals encountered in various ways. One respondent described systematically seeking people out, and asking them to suggest other people she should talk to. Again, it was the people doing smaller projects who found useful information through various informal channels. One respondent noted that the most useful people to talk with are those who “are maybe just a little further ahead than me, and not in a big institution. I could maybe get, you know, the next level of what’s happening. If they’re too big, too far ahead, then they’re doing things that we might never do.” Both those with small projects and those with comprehensive ones were sought out by others who wanted to see what they were doing, and perhaps ask advice. Those doing the largest projects seemed to be the ones most often contacted; probably because their formal contributions to the profession brought them to people’s notice. Only one of the respondents found very little of value through either formal or informal communication channels.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
169
Conclusion How, then, is the professional knowledge base growing, as represented in these eight projects? Rogers’s five factors clearly played a role. Observability appears to have had a strong impact on the decision to undertake a digitization project. At the most general level, professional journals, association conferences, conversations with colleagues, and the varied uses of images on the Web have created a sense of inevitability about image digitization and the incorporation of images into databases and online catalogs. Some of the respondents even had difficulty identifying a source for their decision to do a project; as one curator said, “It’s everywhere.” More specifically, the willingness of local innovators to speak at local and national conferences has brought small- to medium-scale projects into visibility. The VRA listserv in particular has played an important role in allowing for the exchange of technical information between people at various stages of trial and implementation. Where subject access is concerned, formal channels seem to be more important than informal. Respondents reported many conversations about scanning, but relatively few about subject access. All of the respondents were at least aware of the Art and Architecture Thesaurus and considered it to be an obvious alternative to consider, though not all found it the obvious choice. LCSH, because of its widespread use in libraries, is also a mustconsider, though not a must-choose. ICONCLASS and TGM1 fall far behind the other two vocabularies in being established in the professional knowledge base. One of the questions out of which this study arose concerned the boundary-spanning nature of professional knowledge growth. Half of the people I interviewed had professional degrees in Library and Information Studies. One had a degree in museum studies (anthropology), which included a course in collection management. The others had masters degrees or Ph.D.s in art history. Visual resource collections in art or architecture are not traditionally staffed by librarians, nor are they usually a part of a university library system; normally they fall administratively within a department or school. Museum curators and Registry Department staff members are a different profession yet (and in fact these two groups themselves have very different professional roles). The Visual Resource Association is bringing many of these individuals into the same conversation, although each professional group has its own associations as well. The Getty has also contributed to the observability of innovation across professional boundaries through its development of the AAT. Trialability has worked in favor of rapid growth of knowledge in the area of technology for physical access, and somewhat against change in subject access. Because the equipment is relatively low cost, it’s possible to invest in a scanner and software and try out the technology on a clearly demarcated part of the collection. Pilot projects make sense; the resource investment doesn’t require a long-range commitment to total conversion for justification. Intellectual access, however, poses
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
170
different problems. Changing subject access is not something a curator is likely to do on a provisional basis. In fact, the prevalence of incremental approaches to digitizing which results from small grants, uncertainty about the best way to do things, and limited resources, may work against long-term, comprehensive planning for subject access. One way of resolving the tension between need for stability and need for change is provided by the flexibility of the field structure of an electronic database; the ability to create multiple subject fields allows the use of new and old vocabularies simultaneously, thus permitting trial of the new vocabulary without losing the known usefulness of the old. Four of the eight projects studied incorporated part or all of their old system into the database, and added fields for AAT, LCSH, or ICONCLASS. Of the four that did not, two made a wholesale change to standard vocabularies, one made a wholesale change to an in-house vocabulary created from specialized dictionaries and reference works, and one created hers as she added the images. Complexity is an important factor; imagebase projects involve copyright law (about which there are a variety of conflicting opinions), image manipulation techniques, large amounts of storage, time (one curator found it took 30 minutes per slide when time to size it and make corrections was included), data structures, and vocabularies. Many of these aspects are not yet a well established part of the professional knowledge base. Respondents were particularly uncertain about copyright and appropriate resolutions. The intellectual complexity of subject access has clearly been a barrier to the emergence of a professional standard. AAT has the highest level of adoption, with use by five of the eight sites. However, only one uses it alone. LCSH is used by three collections, always in conjunction with other vocabularies. All but one of the collections uses local terms at least as occasional supplements to standard vocabularies, and five use them as the primary vocabulary. Rogers’s factor of relative advantage highlights additional aspects of current professional knowledge growth. Slide libraries have gotten along with rigid filing systems, small labels, limited-access lightboxes, and projectors for decades. Where visual information is sought for clearly defined and slow changing purposes, the ability to browse online has little urgency. Shatford’s question about which subjects should be indexed, and which should not, is still far from being answered. When asked about uses of the collection, the curators described known-item, or knowncategory, searches, not the exploratory, iterative, refine-the-question-as-you-go searches described in research on textual information seeking behavior (Kuhlthau, 1993). One of the curators reflected on the fact that visual resource collections are based on “specific, deep, repeated courses.” This makes them very different from collections supporting creative work such as advertising, where what is desired may be a concept or a mood; there, an imagebase may have more persuasive advantages. The apparent predictability of information needs in this small data set raises the question of the importance of indexing at the levels of aboutness and interpretation. Only one of the projects actually uses ICONCLASS, although another has a field for its eventual use. None use TGM1. The in-house vocabularies, with the exception of
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
171
2-M-art, do not reflect analytic or interpretive approaches. Would availability of a truly comprehensive imagebase with rich subject access to both image and objects within the image change the way faculty work? Perhaps; technologies often have unforeseen impacts. But possible and unforeseen impacts are not the motivators for change. Perhaps the reason students have welcomed digitization to a greater extent than faculty has less to do with their youth than with the fact that the advantage of sitting at a workstation over gathering around a lightbox is much more clear cut. Compatibility, Rogers’s fifth factor, is a problem for many faculty. The faculty members know which slide drawers contain “their” slides, and they appreciate the tactile routines of pulling slides. Slide projectors rarely malfunction, and they serve the need of lecturers quite well. Here, interestingly, changes in subject access are less problematic. As long as slides are physically arranged in the same ways, the database and vocabulary are a sort of superstructure which can be ignored. One librarian (not among the respondents for this study) who was a pioneer in image digitization has created an extremely rich imagebase -- but she told me that her faculty and students never use it. While she has clearly contributed to the knowledge of the profession through her publications, the innovation remains largely unused at the local level. By contrast, the integration of the slide collection into a university OPAC has greatly increased its use by individuals outside the primary departmental clientele. Subject access may be a more revolutionary innovation in visual resource collections than the technological advance of image digitization. Users have always had images available to examine; the change from a slide drawer to a monitor seems one of convenience rather than of intellectual process. It is a much greater change to go from providing access by location, time period, and creator to providing access by subject. It allows one to search for the unknown rather than for the known. The research questions for the larger study for which this was a pilot project will focus on subject access in greater detail. This project provided a sense of the landscape -- current practice, the vocabularies, the contexts of formal and informal knowledge growth within which subject decisions are made. The next step is to add depth to the exploration of subject: the nature and structure of home-grown vocabularies, differences between teaching and research uses of image collections, and an investigation of the thinking behind subject decisions. The processes of growth in a profession’s knowledge base are multivariately messy. Innovation often takes place without clear-cut goals and objectives; rather, there may be only a sense that this is something too important to ignore, or an opportunity may present itself to which the curator has to react. Advances are uneven. Each professional must solve the problems of innovation in the context of a specific organization with needs and expectations which have evolved over time. The new knowledge and practices codified in the published literature and standards of the field may or may not be instantiated in its individual members. Local practice may or may not be communicated much beyond the walls of the institution. Every respondent
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
172
expressed a feeling, at one time or another, of having insufficient information for what she was trying to do. Most also expressed a sense of working alone and needing to reach beyond her own institution for the necessary information and ideas. The data reported here reveal some of the ways in which advances radiate out along a ragged set of formal and informal channels, and accrete to the body of knowledge that forms the foundation of future growth.
References Besser, Howard and Trant, Jennifer (1995). Introduction to imaging: Issues in constructing an image database. Los Angeles: J. Paul Getty Museum Publications. Bowker, Geoffrey and Star, S. Leigh (1997). How things (actor-net) work: Classification, magic, and the ubiquity of standards. Philsophia 25, 195-220 (in Danish). Available in English from: http:// alexia.lis.edu/~bowker/actnet.html. Bowker, Geoffrey C. and Star, Susan Leigh (1999). Sorting things out: Classification and practice. Cambridge: MIT Press. Chang, Shih-Fu, Smith, John R., and Meng, Jianhao. efficient techniques for featurebased image/video access and manipulation. In Heidorn, P. Bryan and Sandore, Beth (Eds.), Digital image access and retrieval; Papers presented at the 1996 Clinic on Library Applications of Data Processing. Urbana/Champaign, Illinois: Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 86-99. Getty Research Institute (1999a). Getty Thesaurus of Geographic Names. Available at: http://shiva.pub.getty.edu/tgn_browser/. Getty Research Institute (1999b). Union List of Artists’ Names. Available at: http://shiva.pub.getty.edu/ulan_browser/ulan_intro.html. Goode, William J. (1969). The theoretical limits of professionalization. In Amitai Etzioni (Ed.), The Semi-Professions and their organization (pp. 266-313). New York: The Free Press. Greenwood, Ernest (1962). Attributes of a profession. In Sigmund Nosow and William H. Forms (Eds.), Man, work, and society (pp. 206-218). New York: Basic Books. Huang, Tom, Mehrotra, Sharad, and Ramchandran, Kannan (1997). Multimedia Analysis and Retrieval System (MARS) project. In Heidorn, P. Bryan and Sandore, Beth (Eds.), Digital image access and retrieval; Papers presented at the 1996 Clinic on Library Applications of Data Processing. Urbana/Champaign, Illinois: Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 100-117.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
173
ICONCLASS Research and Development Group, 1999. The ICONCLASS home page. Available at: http://iconclass.let.ruu.nl/. Jain, Ramesh (1997). Visual information management. Communications of the ACM 40, 31-32. Kenny, Ann R. and Chapman, Stephen (1996- ). Digital imaging for libraries and archives. Ithaca: Cornell University Library, Department of Preservation and Conservation. Kuhlthau, Carol C. (1993). Seeking meaning: A process approach to library and information services. Norwood, N.J.: Ablex. Lanzi, Elisa (Coord.) (1999). Roundtable VI: VISION Project: Issues and outcomes. Panel and discussion presented at the 17th Annual Conference of the Visual Resources Association, Los Angeles, California. Layne, Sara Shatford (1994). Some issues in the indexing of images. Journal of the American Society for Information Science 45, 583-588. Layne, Sara Shatford (1994). Artists, art historians, and visual art information. Reference Librarian 47, 23-36. Library of Congress, Prints and Photographs Division (1995). The Library of Congress Thesaurus for Graphic Materials 1. Subject Terms. Available at http://lcweb.loc.gov/print/tgm1. Markey, Karen (1988). Access to Iconographical Research Collections. LibraryTrends 37, 154-74. Markey, Karen (1986). Subject access to visual resources collections: A model for computer construction of thematic catalogs. New York: Greenwood Pres. Mason, Ellsworth (1971). Great gas bubble prick’d; or, computers revealed -- by a gentleman of quality.” College and Research Libraries 32, 183-196. McRae, Linda and White, Lynda S., Eds. (1998). ArtMARC sourcebook: Cataloging art and architecture and Their visual images. Chicago: American Library Association. Miles, Matthew B. and Huberman, A. Michael (1994). Qualitative data analysis, 2nd edition. Thousand Oaks: Sage Publications. MPEG-2 (1996). Generic coding of moving pictures and associated audio information. Available at: http://drogo.cselt.stet.it/mpeg/standards/mpeg7/mpeg-7.htm. Museum Educational Site Licensing Project (1996). Homepage. Available at http://www.ahip.getty/edu/mesl. O’Connor, Brian C. and O’Connor, Mary Keeney (1999). Categories, photographs, and predicaments: Exploratory research on representing pictures for access. Bulletin of the American Society for Information Science 25, 17-20. Parsons, Talcott (1939). The Professions and social structure. Social Forces 17, 456-567.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
174
Petersen, Toni, Ed. (1994). Guide to indexing and cataloging with the Art and Architecture Thesaurus, 2nd ed. Los Angeles: J. Paul Getty Museum Publications. Petersen, Toni and Molholt, Pat (1990). Beyond the book: Extending MARC for subject access. Boston: G.K. Hall. Pierce, Sydney J. (1987). Characteristics of professional knowledge structures: Some theoretical implications of citation studies. Library and Information Science Research 9, 143-171. Rasmussen, Edie M. (1997). Indexing multimedia: Images. Annual Review of Information Science and Technology 32. Medford, NJ: Information Today, 169-196. Rogers, Everett M. (1995). Diffusion of innovations, 4th ed. New York: The Free Press. Rogers, Everett M. and Kincaid, D. Lawrence (1981). Communication networks: Toward a new paradigm for research. New York: The Free Press. Rogers, Everett M. and Kincaid, D. Lawrence (1981). Communication networks: Toward a new paradigm for research. New York: The Free Press. Schuller, Nancy S. (1989). Guide for Management of Visual Resource Collections. Visual Resources Association. Shatford, Sara (1986). Analyzing the subject of a picture: A theoretical approach. Cataloging and Classification Quarterly 6, 39-62. Shatford, Sara (1984). Describing a picture: A thousand words are seldom cost effective. Cataloging & Classification Quarterly 4, 13-30. Small, Jocelyn Penny (1991). Retrieving images verbally: No more key words and other heresies. Library Hi Tech 9, 51-60, 67. Suchman, Lucy A. (1987). Plans and Situated Actions: The Problem of HumanMachine Communication. Cambridge: Cambridge University Press. The Vision Project (n.d.) Available at: http://cobweb.cc.obertlin/art/vra/vision.html. Trant, Jennifer (1997). Exploring new models for administering intellectual property: The Museum Educational Site Licensing Project. in Heidorn, P. Bryan and Sandore, Beth (Eds.) , Digital image access and retrieval; Papers presented at the 1996 Clinic on Library Applications on Data Processing. Urbana/Champaign, Illinois: Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 29-41. Visual Resources Association (1997). Data standards. Available at: http://www.oberlin.edu/~art/vra/dsc.html. Wasserman, S, and Faust, Kate (1993). Social network analysis: Methods and applications. Cambridge: Cambridge University Press.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
175
Weedman, Judith (1992). Informal and formal channels in boundary-spanning communication. Journal of the American Society for Information Science 43, 257-267. Weedman, Judith (1993). On the “isolation” of humanists: A report of an invisible college. Communication Research 20, 749-776. Weedman, Judith (1998). The Structure of Incentive: Design and Client Roles in Application-Oriented Research. Science, Technology, & Human Values 23, 314-345. Weedman, Judith (1999). Conversation and community. Journal of the American Association for Information Science 50, 907-928. Weibel, S. and Miller, E. (1997). Image description on the Internet: A summary of the CNI/OCLC Image Metadata Workshop, September 24-25, 1996, Dublin, Ohio. D-Lib Magazine, January. Available from: http://www.dlib.org/dlib/january97/oclc/01weibel.html. Wellman, Barry and Berkowitz, S.D. (1988). Social structures: A network approach. Cambridge: Cambridge University Press. White, Howard D. and McCain, Katherine (1998). Visualizing a discipline; an author co-citation analysis of information science, 1972-1995. Journal of the American Society for Information Science 49, 327-355.
Washington, D.C., 31 October 1999
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
Washington, D.C., 31 October 1999
176
Weedman
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
177
Contributors Jack Andersen Royal School of Library and Information Science Birketinget 6 2300 Copenhagen S Denmark Email:
[email protected] Terrence A. Brooks School of Library and Information Science University of Washington Box 352930 Seattle, WA 98195-2930 Email:
[email protected] Frank Sejer Christensen Royal School of Library and Information Science Birketinget 6 2300 Copenhagen S Denmark Email:
[email protected] Elisabeth Davenport Napier University Business School Edinburgh, EH11 4BN Scotland Email:
[email protected] Victoria Francu Central University Library 6, Transilvaniei St. Bucharest Romania E-mail:
[email protected] Elin Jacob School of Library and Information Science Indiana University 10th & Jordan Bloomington, IN 47405 Email:
[email protected]
Washington, D.C., 31 October 1999
Contributors
Proceedings of the 10th ASIG SIG/CR Classification Research Workshop
178
Hope A. Olson School of Library & Information Studies University of Alberta 3-20 Rutherford South Edmonton, AB T6G 2J4 Canada E-mail:
[email protected] Uta Priss School of Library and Information Science Indiana University 10th & Jordan Bloomington, IN 47405 Email:
[email protected] Miguel E. Ruiz The University of Iowa School of Library and Information Science 3087 Main Library Iowa City, Iowa 52246-1420 E-mail addresses:
[email protected] Padmini Srinivasan. The University of Iowa School of Library and Information Science 3087 Main Library Iowa City, Iowa 52246-1420 E-mail addresses:
[email protected] Judith Weedman San José State University School of Library and Information Science Fullerton, CA 92834-4150 E-mail:
[email protected]
Washington, D.C., 31 October 1999
Contributors