A New Method for Matching a Document to

A New Method for Matching a Document to Potential Users’ Information Needs Yuri KAGOLOVSKY, Stefan PANTAZI, Jochen R. MOEHR School of Health Information Science, University of Victoria, Victoria, BC, V8W 3P5, Canada Abstract. This paper explores approaches to finding out how the information needs that a document can address can be captured. This is important in order to improve indexing strategies applied to document collections. We propose and implemented a cognitive science approach, the “jeopardy game” method of evaluation combined with “think aloud” analysis. The results of the demonstration study are presented and discussed. Some possible improvements to the method for matching a document to potential users’ information needs are identified.

1. Introduction Information needs (INs) in information retrieval (IR) have been intensively studied during the last 40 years [1, 2]. INs are important in order to understand how users interact with information systems. A variety of methods have been used to investigate how users work with information resources and understand them. These methods are often qualitative in their nature, and are taken from areas of social sciences and humanities: assessment of user ability to find and apply specific information [3-5], observation and monitoring of user interaction with an information system [6, 7], and “think aloud” protocol analysis [8, 9]. Although information needs have been in the focus of information science research for at least 40 years, there is still no reliable method for investigating what potential information needs a document can satisfy. This is important to improve the process of searching and finding existing text documents in a document collection. For example, without thoroughly understanding the requirements of users of an information source an indexing strategy for a document collection would be difficult to develop. In this study we are primarily interested in finding out how the INs that a document can address can be captured. 2. Theoretical considerations A user’s information need can exist in the following three forms: implicit, explicit (an information need statement), and in the form of a query for submission to a search engine [10, 11]. The first form, implicit IN, can be thought of as a cognitive structure consisting of concepts connected by conceptual relationships. These concepts and relationships are activated when a user encounters a problem. In the case when the problem cannot be solved without additional information, a user has an anomalous state of knowledge (ASK). This concept was introduced by Belkin [12, 13] and is considered a central concept in understanding the role of a user in the information retrieval process [14, 15]. In order to resolve the ASK, the cognitive structure representing an information need has to be

transformed into an explicit information need statement (INS). This transformation into the second form of IN is usually difficult as is well known to reference librarians and information professionals. The third IN form is a formal query ready for submission to a search engine. The process of formulating and re-formulating queries is also very problematic for the users who are not information specialists, because it requires knowledge of search language syntax and is often specific to a search engine. These different potential representations have to be distinguished when one talks about IN. Moreover, it is desirable to adequately and reliably translate the contents of a document into any one of the above representations in order to characterize the content and in order to assess to what extent the document is able to match an IN expressed in similar manner. Commonly the indexing of documents, for example, by indexers of libraries consists of matching the contents of a document, as perceived by the indexer, to a controlled terminology such as MeSH. These “manual” approaches are complemented by automated ones, such as word statistical approaches [16-18]. Both are not satisfactory since they can lead to misclassification of documents, and poor performance of document retrieval approaches. Some researchers have long advocated a necessity of a “user-centered indexing” [19]. We propose here an approach, which may help to implement this kind of indexing, as well as to reduce the shortcomings of the traditional approaches. The fact that users have difficulties translating implicit information needs into natural language, makes the task of eliciting INs difficult. Moreover, the task of the experiment has to be understood by participants. Thus, simply giving a user a document and asking what information needs it could meet would not work, as users would have difficulties understanding what is meant by “information needs.” We suggest that the approach proposed by us overcomes these obstacles. 3. The method In our approach [10, 11], users are asked to model potential ASK by asking possible questions that a text provides answers to. We call this a “jeopardy game” approach, as it resembles a popular North American TV show. In this show, the participants are presented with an answer to a hypothetical question. To solve the puzzle, the participants have to formulate the question that matches the answer presented. In our version of the “jeopardy game” a user is presented with a text and asked to read it aloud. This is done in order to assess the user’s knowledge of terminology and a domain of study, as well as her cognitive processes involved in text comprehension. After the user finishes reading, she is asked to identify potential questions that the text can answer: “What are the questions that this text can answer?” or “If this text has some answers, can you formulate the questions matching these answers?” Alternatively, the participants can formulate their information needs in the following form: “I would be interested in reading this document if I wanted to find information about…” 4. Subjects and experiment The experiment was conducted on eight people with different levels of knowledge of biology and medicine and their terminology (1 – low level, 2 –medium, and 3 – high). Users were presented with a short text. The text was a paragraph from the document on coronary heart disease1 created by the Mayo Clinic. It is also available through a link from the National 1

“Coronary Artery Disease” (http://www.mayoclinic.com/invoke.cfm?id=DS00064)

Library of Medicine MedlinePlus database2, which provides health care information to the general public: “If your coronary arteries can't supply enough blood to meet the oxygen demands of your heart, the result may be chest pain called angina. It's often described as a pressure or tightness in the chest — as if someone were standing on your chest. Angina is usually brought on by physical or emotional stress. Stress increases the amount of blood the heart needs, but narrowed arteries prevent enough blood from getting to the heart muscle. The pain typically goes away within minutes of stopping the stressful activity (as the increased demand for oxygen is reduced). Angina can also be relieved by a medication called nitroglycerine, and controlled with other heart medications as well.” Every participant was first asked about her/his level of knowledge of biology and medicine, and their terminology. Then the purpose of the experiment was explained. The participants were instructed to read the text aloud and comment (“think aloud”) on any terms or problems with text understanding that they might encounter during reading. After finishing reading, they were asked to formulate possible question(s) that the text can provide answers to. There was no limit either on time or on number of questions that participants could generate. Experiments were considered approaching the end when users could not come up with another information request statement. To ensure that this is true, the users were asked if they can add something else. If the answer was negative, the experiment was considered finished. Audio records of the experiments then were transcribed, and coded. For the purpose of this study the coding was focused on identifying questions generated by the participants. 5. Results The participants generated 26 different questions, with the average number of the questions per person being 7 (min 5, max 9). It was not demonstrated that people with a better knowledge of medicine and its terminology generated a higher number of questions. The participants did not have problems with either understanding the instructions of the experiment, text comprehension, or question generation. This can be attributed to the fact that a similar pattern is used in the popular North American TV show called “Jeopardy.” We found that subjects generated similar questions. For example, the majority of the subjects generated the following questions: “What is angina?”, “What causes angina?”, and “What are medications that can relieve angina?” There did not appear to be major differences in the nature of the questions generated by subjects with different levels of biomedical knowledge. 6. Discussion The experiment demonstrated that the proposed method of identifying potential users’ INs is easy to implement and use. However, while analyzing the results, we noticed that it was difficult to compare questions. Some of the questions can be considered as having similar meaning, although they were formulated differently. This conclusion was arrived at by either comparing questions to their possible answers from the text, or based on the recordings in the cases when users provided answers to their questions, even without being explicitly asked to do so. For example, the following questions “How does angina feel?”, “What are the symptoms of angina?”, and “How do you describe pain of angina?” all refer to the same part of the text: “It's often described as a pressure or tightness in the chest — as if someone were standing on your chest”. At the same time, questions formulated the same way sometimes 2

http://www.nlm.nih.gov/medlineplus/coronarydisease.html

referred to different answers. Thus, while asking a question “What causes angina?” some participants referred to “physical and emotional stress” and others to “coronary arteries can't supply enough blood to meet the oxygen demands of your heart.” These results suggest a modification of the original evaluation strategy. First, users have to be instructed to provide answers from the text to every question they generate. Second, it could be advantageous also to ask users to provide a short summary of the text at the end of the experiment. The proposed changes are consistent with the current understanding of cognitive processes [20]. It has been demonstrated that knowledge structures in the human mind can be viewed as an associative net, consisting of richly interconnected propositions. Although the term “proposition” is often used in logic, cognitive psychologists and cognitive linguists extended its meaning. A proposition in this sense is considered as a predicate-argument schema, and is often seen as a basic unit of semantics, and language, in general [21]. Psychological experiments have also demonstrated that propositions can be considered as the semantic processing units of the mind [20]. Direct evidence of this comes from the following studies: cue and free recall [22], reading times and recall [23], and priming studies [24]. Propositions appear to be the key semantic processing units of the mind and hence the most useful form of representation of text semantics. It has been argued that a network of propositions as a method of knowledge representation permits to overcome the majority of the limitations of fixed knowledge structures, while combining and extending their advantages. At the same time, more and more researchers agree that knowledge is represented as a network, where the nodes of the net can be propositions, schemas, frames, scripts, and production rules [20]. The meaning of a node is given by its position in the net and the strengths with which it is linked to its neighbors. Atomic propositions consist of a predicate (a relation), and one or more arguments. The predicate usually controls the number and kinds of arguments that can be used with it. Predicates are frequently represented by verbs, adverbs, and adjectives, while arguments can be represented by nouns. Complex propositions consist of an atomic proposition, time and place circumstances, and optional modifiers. Complex propositions are combined into micropropositions, representing a text meaning. Some rules have also been proposed that can transfer elaborated micro-propositional structure into macro-propositions, which represent a summary of the text [20, 25, 26]. Therefore, having questions and answers would permit to reconstruct propositions representing users’ information needs and to overcome problems associated with possible mis-interpretation of the questions formulated by the participants. Summaries of the text would provide a good approximation of macropropositions. 7. Conclusion The results of the experiment demonstrated that a combination of the “jeopardy game” principle with “think-aloud” protocol analysis might be useful in order to evaluate the contents of textual documents with health care information. This combination permits to ensure that the terminology used in a document, the level of domain knowledge required to comprehend it, and logical coherence are appropriate for the intended audience. Our method also allows to investigate users with different levels of knowledge of a domain and of the domain’s language. This is especially important now, when more and more health care information is published for the general public. One of the main advantages of the proposed method is that it permits researchers to identify users’ information needs that a document can satisfy without imposing limitation on a number of possible information needs. Therefore, the

results of the approach can also help in improving indexing documents in databases, as well as structuring documents based on cognitive processes involved in text comprehension. References 1. Westbrook L. User Needs: A Synthesis and Analysis of Current Theories for the Practitioner. RQ 1993;32(4):541-549. 2. Wilson T, editor. Information Needs and Uses: Fifty Years of Progress?; 1994. 3. Egan DE, Remde JR, Gomez LM, Landauer TK, Eberhardt J, Lochbaum CC. Formative Design-Evaluation of SuperBook. ACM Transactions on Information Systems 1989;7(1):30-57. 4. Wildemuth BM, de Bliek R, Friedman CP, File DD. Medical Students' Personal Knowledge, Searching Proficiency, and Database Use in Problem Solving. JASIS 1995;46(9):590-607. 5. Hersh W, Pentecost J, Hickam D. A Task-Oriented Approach to Information Retrieval Evaluation. JASIS 1996;47(1):50-56. 6. Shneiderman B. Designing the user interface: strategies for effective human-computer interaction. Reading, MA: Addison-Wesley; 1992. 7. Borgman CL, Hirsh SG, Hiller J. Rethinking Online Monitoring Methods for Information Retrieval Systems: From Search Product to Search Process. JASIS 1996;47(7):568-583. 8. Kushniruk AW, Kaufman DR, Patel VL, Levesque Y, Lottin P. Assessment of a computerized patient record system: A cognitive approach to evaluating medical technology. M.D. Computing 1996;13(5):406-415. 9. Kushniruk AW, Patel VL. Cognitive evaluation of decision making processes and assessment of information technology in medicine. Int. J. Med. Inform. 1998;51:83-90. 10. Kagolovsky Y. Systems Analytic Approach to the Evaluation of Information Retrieval (IR) [MSc thesis]. Victoria: University of Victoria; 2000. 11. Kagolovsky Y, Moehr JR. A new approach to the concept of "relevance" in information retrieval (IR). In: Patel VL, Rogers R, Haux R, editors. MedInfo'2001: 10th World Congress on Health and Medical Informatics; 2001; London, England: IOS Press; 2001. p. 348-52. 12. Belkin NJ, Oddy RN, Brooks HM. ASK for information retrieval: Part I. Background and theory. Part II. Results of design study. JD 1982;38:61-71, 145-164. 13. Belkin NJ. Cognitive models and information retrieval. Soc Sci Inf Stud 1984;4:111-129. 14. Robertson SE, Hancock-Beaulieu MM. On the Evaluation of IR Systems. IPM 1992;28(4):457-466. 15. Schamber L. Relevance and information behavior. In: ARIST; 1994. p. 3-48. 16. Hersh WR. Information Retrieval: A Health Care Perspective. New York, N.Y.: Springer-Verlag New York, Inc.; 1996. 17. Kagolovsky Y, Miller M, Moehr JR. Statistical concept representation for indexing of clinical narratives. In: Fisher P, editor. COACH Conference 22; 1997 April 14-16, 1997; Vancouver, BC, Canada: HC&CC (Healthcare Computing & communications Canada, Inc.), Edmonton, Alberta, Canada; 1997. p. 118-126. 18. Meadow CT, Kraft DH. Text Information Retrieval Systems. 2nd ed. ed. San Diego: Academic Press; 2000. 19. Fidel R. User-centered indexing. JASIS 1994;45(8):572-6. 20. Kintsch W. Comprehension: A paradigm for cognition. New York: Cambridge University Press; 1998. 21. Perfetti CA, Britt MA. Where Do Propositions Come From? In: Discourse Comprehension: Essays in Honor of Walter Kitsch. Hillsdale, New Jersey: Lawrence Erlbaum Associates, Publishers; 1995. p. 11-34. 22. Goetz ET, Anderson RC, Schallert DL. The representation of sentences in memory. JVLBA 1981;20:369-85. 23. Kintsch W, Keenan JM. Reading rate and retention as a function of the number of propositions in the base structure of sentences. Cog Psyc 1973;5:257-79. 24. Ratcliff R, McKoon G. Priming in item recognition: Evidence for the propositional structure of sentences. JVLBA 1978;17:403-18. 25. Dijk TAv. Macrostructures : an interdisciplinary study of global structures in discourse, interaction, and cognition. Hillsdale, N.J.: L. Erlbaum Associates; 1980. 26. Dijk TAv, Kintsch W. Strategies of Discourse Comprehension. New York: Academic Press; 1983.

A New Method for Matching a Document to

A New Method for Matching a Document to

Suggest Documents

A New Compression Method for Compressed Matching

A New Feature Detector and Stereo Matching Method for ... - CiteSeerX

A new fast matching method for adaptive compression of stereoscopic ...

A New Feature Detector and Stereo Matching Method for Accurate

A New Method for Matching Network Adaptive Control - IEEE Xplore

A New Method for Matching Network Adaptive Control - IEEE Xplore

A New Method for Comparing and Matching Snow ... - Semantic Scholar

A New Point Pattern Matching Method for Palmprint - Google Sites

a hierarchical image matching method for stereo

A filtration method for order-preserving matching

A Fast and Reliable Matching Method for

A maximal-information color to gray conversion method for document ...

A new document

A maximal-information color to gray conversion method for document ...

a new method for phosphorescence

A NEW METHOD FOR CONSTRAINING

A NEW PROJECTION METHOD FOR

BBCI: A new initiative to document Chinese bumble bees for ...

BBCI: A new initiative to document Chinese bumble bees for ...

Tumor sensitive matching flow: A variational method to ... - Google Sites

A Volumetric Stereo Matching Method: Application to ... - USC IRIS

How To Create a New Impress Document Using a Template

A New Method to Measure Brain Serotonin

A new method to derive electronegativity from