Dynamic Query Expansion for Information ... - Semantic Scholar

1 downloads 0 Views 3MB Size Report
Nashville (TN):. Department of Biomedical Informatics, Vanderbilt University, 1999. CCS99 .... 1999. NIC99. McCloskey, Joanne C.; Bulechek, Gloria M., editors.
Dynamic Query Expansion for Information Retrieval of Imprecise Medical Queries

Submitted by Dennis Wollersheim, BSW La Trobe, BSc Lethbridge

A thesis submitted in total fulfilment of the requirements for the degree of Doctor of Philosophy

Department of Computer Science and Computer Engineering Faculty of Science and Technology La Trobe University Bundoora, Victoria 3086 Australia May, 2005

Table of Contents Table of Contents ................................................................................................................................. ii List of Figures.......................................................................................................................................iv List of Tables .........................................................................................................................................v Abstract............................................................................................................................................... vii Statement of Authorship .................................................................................................................. viii Publications Arising From the Thesis .................................................................................................x Acknowledgement .............................................................................................................................. xii Chapter 1 Introduction.........................................................................................................................1 1.1 Targeted Information Retrieval ............................................................................................. 1 1.2 Overview of Imprecise Search .............................................................................................. 3 1.3 Objectives .............................................................................................................................. 9 1.4 Thesis plan........................................................................................................................... 10 1.5 Conclusion........................................................................................................................... 11 1.6 Bibliography ........................................................................................................................ 12 Chapter 2 Literature survey ..............................................................................................................13 2.1 Introduction ......................................................................................................................... 13 2.2 Structure in Medical Information Retrieval......................................................................... 14 2.3 General Information Retrieval............................................................................................. 17 2.4 Query Expansion ................................................................................................................. 21 2.5 Conclusion........................................................................................................................... 41 2.6 Bibliography ........................................................................................................................ 42 Chapter 3 Problem Formulation .......................................................................................................48 3.1 Introduction ......................................................................................................................... 48 3.2 Current Issues in Medical Information Retrieval ................................................................ 48 3.3 Problem Background ........................................................................................................... 52 3.4 Medical Data Sets................................................................................................................ 60 3.5 Conclusion........................................................................................................................... 65 3.6 Bibliography ........................................................................................................................ 65 Chapter 4 General Framework Of The Proposed Methodology ....................................................67 4.1 Introduction ......................................................................................................................... 67 4.2 Proposed Framework........................................................................................................... 67 4.3 Benefits of this Work........................................................................................................... 72 4.4 Evaluation............................................................................................................................ 72 4.5 Conclusion........................................................................................................................... 73 4.6 Bibliography ........................................................................................................................ 74 Chapter 5 Query Expansion Methodology .......................................................................................75 5.1 Introduction ......................................................................................................................... 75 5.2 Expansion Framework Structure ......................................................................................... 77 5.3 Success Measures ................................................................................................................ 77 5.4 Corpus and Ontology Descriptive Variables ....................................................................... 78 5.5 Manipulable Variables Influencing Query Expansion Success........................................... 79 5.6 Case Study ........................................................................................................................... 95 5.7 Conclusion........................................................................................................................... 98

ii

5.8

Bibliography ........................................................................................................................ 99

Chapter 6 Query Expansion Evaluation Framework Implementation and Results...................101 6.1 Introduction ....................................................................................................................... 101 6.2 Materials ............................................................................................................................ 101 6.3 Method............................................................................................................................... 103 6.4 Results ............................................................................................................................... 111 6.5 Discussion.......................................................................................................................... 125 6.6 Conclusion......................................................................................................................... 127 6.7 Bibliography ...................................................................................................................... 128 Chapter 7 Dynamic Indexing ...........................................................................................................129 7.1 Introduction ....................................................................................................................... 129 7.2 Method............................................................................................................................... 131 7.3 Index Phrase Generation.................................................................................................... 132 7.4 Constructing the Subsumption Hierarchy.......................................................................... 141 7.5 Conclusion......................................................................................................................... 145 7.6 Bibliography ...................................................................................................................... 146 Chapter 8 Dynamic Query Expansion Prototype System .............................................................148 8.1 Introduction ....................................................................................................................... 148 8.2 Research Methodology ...................................................................................................... 148 8.3 Conceptual Framework...................................................................................................... 149 8.4 System Architecture .......................................................................................................... 150 8.5 Evaluation Method ............................................................................................................ 151 8.6 DyQE Implementation....................................................................................................... 152 8.7 Case Study ......................................................................................................................... 159 8.8 Results ............................................................................................................................... 164 8.9 Discussion.......................................................................................................................... 165 8.10 Conclusion......................................................................................................................... 167 8.11 Bibliography ...................................................................................................................... 167 Chapter 9 Conclusion .......................................................................................................................168 9.1 Query Expansion ............................................................................................................... 168 9.2 Dynamic Indexing ............................................................................................................. 170 9.3 DyQE ................................................................................................................................. 171 9.4 Conclusion......................................................................................................................... 171 Glossary .............................................................................................................................................173 Appendix 1 UMLS semantic network types. ..................................................................................176 Appendix 2 UMLS Source Vocabularies ........................................................................................190 Appendix 3 Query Expansion Evaluation Framework Sample Code..........................................195 Appendix 4 XML Parameter File for Query Expansion Framework ..........................................220

iii

List of Figures Figure 1-1 Thesis plan. ......................................................................................................................... 11 Figure 2-1 Hypertext HTML version of text guidelines. ...................................................................... 16 Figure 2-2 Hierarchy of query expansion feature sets .......................................................................... 22 Figure 2-3 Example taxonomy, and subsequently classified text atoms............................................... 36 Figure 2-4 Remaining taxonomy after a zoom on the concept amoxicillin. ......................................... 37 Figure 3-1 Thesis general problem overview. ...................................................................................... 60 Figure 4-1 Query expansion evaluation framework.............................................................................. 70 Figure 4-2 Dynamic Index build process from an existing ontological resource. ................................ 71 Figure 5-1 Query expansion framework process. ................................................................................. 76 Figure 5-2 Depth based query expansion.............................................................................................. 84 Figure 5-3 Contrast between depth and probability based expansions. ................................................ 84 Figure 5-4 Simple probability query expansion method example. ....................................................... 85 Figure 5-5 Probability based expansion starting from chronic inflamatory polyneuropathy. .............. 86 Figure 5-6 Weighted voting calculation................................................................................................ 89 Figure 5-7 Interconnected source concept method 1. ........................................................................... 93 Figure 5-8 Interconnected source query concept method 2. ................................................................. 94 Figure 5-9 Interconnected source query concept example.................................................................... 95 Figure 5-10 Log multiple expansion for child concepts of the term prostate cancer........................... 98 Figure 6-1 Transformation of Ohsumed text to UMLS connected noun phrases. .............................. 102 Figure 6-2 Least retrievable document measure determination.......................................................... 108 Figure 6-3 Depth limited expansion precision for selected relationship types. .................................. 112 Figure 6-4 Depth limited expansion count precision using least retrievable document measure. ...... 113 Figure 6-5 Comparison between UMLS thesaurus relationships and co-occurrence relationships.... 115 Figure 6-6 Comparison of query expansion by relationship source. .................................................. 116 Figure 6-7 Average probability intra-method count precision comparison. ....................................... 118 Figure 6-8 Average probability intra method weight precision comparison. ..................................... 119 Figure 6-9 Comparison between simple probability and simple voting probability methods. ........... 120 Figure 6-10 Comparison between simple probability and weighted voting method. ......................... 121 Figure 6-11 Comparison between depth limited search, and ISQC methods. .................................... 122 Figure 6-12 Comparison of weight precisions for each standard medical query type. ....................... 124 Figure 6-13 Comparison of expansions generated for single source concept semantic types. ........... 125 Figure 7-1 Dynamic indexing use after incorporation of arbitrary query expansion.......................... 130 Figure 7-2 Map of work done in chapter 7. ........................................................................................ 131 Figure 7-3 Automated construction of dynamic indexing components. ............................................. 133

iv

Figure 7-4 Matching of UMLS concept to Therapeutic Guidelines text. ........................................... 135 Figure 7-5 Cycle of parent links within SNOMED. ........................................................................... 142 Figure 7-6 Algorithm for generating hierarchy without cycles. ......................................................... 143 Figure 7-7 Sample of extracted hierarchy relating to concept Dyshidrotic Eczema. .......................... 144 Figure 7-8 Extract of a generated hierarchy........................................................................................ 145 Figure 7-9 High level view of extracted hierarchy. ............................................................................ 146 Figure 8-1 High-level diagram of the DyQE data preparation and usage........................................... 153 Figure 8-2 Process for generating a query expansion rule set. ........................................................... 154 Figure 8-3 Process for generating a concept tagged corpus, and a corpus specific hierarchy. ........... 154 Figure 8-4 DyQE screen layout example. ........................................................................................... 155 Figure 8-5 DyQE query processing. ................................................................................................... 156 Figure 8-6 DyQE example of a definitional query.............................................................................. 162 Figure 8-7 DyQE example of a definitional query starting from the concept HIV Infection.............. 163 Figure 8-8 DyQE Example of a zoom on terms Herpes Simplex and HIV. ....................................... 164 Figure 8-9 Expansion concepts generated in response to a diagnosis type query related to AIDS. .... 164

v

List of Tables Table 1-1 Major categories of medical query type and their imprecise analogues................................. 4 Table 2-1 Features of various query expansion strategies .................................................................... 23 Table 2-2 Summary of interactive QE features..................................................................................... 39 Table 3-1 Retrieval formality comparison. ........................................................................................... 50 Table 3-2 English language synonyms for the UMLS concept C0037267 (skin). ............................... 61 Table 3-3 UMLS tables used in this thesis. .......................................................................................... 63 Table 3-4 MRREL base relationship types. .......................................................................................... 63 Table 5-1 Descriptive variables inherent in source concept or query. .................................................. 78 Table 5-2 Query expansion methods feature set summary. .................................................................. 82 Table 5-3 Concepts generated by the subtract average probability method. ........................................ 87 Table 5-4 Weight precision of each of the major UMLS relationship types. ....................................... 89 Table 5-5 Weighted voting method example. ....................................................................................... 90 Table 5-6 Log multiple method example. ............................................................................................. 93 Table 5-7 Concepts from a semantic voting expansion method. .......................................................... 97 Table 6-1 Count and weight precision scores for depth based expansion method. ............................ 114 Table 6-2 Medical query prototypes. .................................................................................................. 123 Table 7-1 Advantages and disadvantages of various index generation methods................................ 133 Table 7-2 Candidate phrase set generation. ........................................................................................ 137 Table 7-3 LVG command line switches used by phrase normalisation phase of the algorithm ......... 137 Table 7-4 Result of normalising phrases............................................................................................. 137 Table 7-5 Precision and recall of various index term generation algorithms...................................... 139 Table 8-1 Possible user actions in the DyQE system.......................................................................... 157 Table 8-2 Summary of feasibility of design evaluation results........................................................... 165 Table 8-3 Features of DyQE which support the functional usability criteria. .................................... 166

vi

Abstract This thesis proposes a system to enhance information retrieval of imprecise queries in the medical domain. We provide a set of techniques for use in the situation where the initial user query does not exactly specify the information needed. Our solution has three parts: 1) user query expansion based on selected ontological relationships; 2) dynamic indexing, focusing on indexing of query expansion results; and 3) interactive browse based summarisation, using the methods developed above. In part one, we develop a method for generating high quality ontology based query expansions (QE) through an innovative fine-grained query expansion evaluation framework. The query expansion methods range from the simple methods currently in use in the query expansion field, to more complex methods taking advantage of ontological semantic information such as concept location, link degree, and type of relationship. The evaluation method of the proposed QE techniques examines the connections between the concepts from a set of queries in a medical document test corpus, and the concepts from the documents judged relevant to the queries. By locating the documents and queries in ontological space, and comparing the document concepts with the set of concepts generated by repeatedly expanding the document queries, we discover those concept characteristics and expansion methods which provide the highest quality expansions. Because of the responsive evaluation mechanism, we are able to judge expansion effectiveness in relation to both the type of expansion method, and also in relation to intrinsic query and ontology specific variables, in both ontology-structural and medical categories. Of particular interest is the development of query expansion rule sets which maximise expansion effectiveness in relation to the medical query category. Leading on from the QE generation process, we note that, even with the best quality query expansion mechanism, the QE process tends to be characterised by overshoot, that is, the retrieval of a document set that is a superset of the desired document set. To address this issue, part two of the thesis looks at dynamic indexing (DI), a technique for summarisation and interactive management of large retrieval sets. Our contribution in the area of DI is the development of a process whereby DI can be implemented on an arbitrary medical ontology and corpus. Because of DI’s specific requirements, this entails two parts: 1) the identification of index concepts within a document corpus; and 2) the building of a hierarchical index based on the identified corpus terms, drawn from an arbitrary ontology. For the former, we choose the best way to identify thesaurus concepts in a given corpus by comparing a variety of concept identification methods. The latter involves the development of an algorithm to

vii

generate principled subsumption hierarchies, drawing relationships from an ontology which was not designed for such a purpose. Dynamic Query Expansion (DyQE) brings together the above work, combining multi directional query expansion with interactive browsing techniques including dynamic indexing. This provides controlled, integrated access to dynamically generated query expansion information. Our work allows DyQE to be implemented on an arbitrary corpus, with the only necessary ingredients being a domain specific ontology and a set of corpus based relevance judgements. In operation, DyQE is a multiple modality corpus explorer. It offers user text query, dynamic indexing, summary based browsing, and dynamic query expansion based on the current index term and chosen medical query type. This array of exploration types is supplemented by features that support user orientation, such as query preview and document summarisation. DyQE is evaluated by setting out fulfilment criteria, and then building the DyQE prototype and noting how it meets the criteria. The document base preparation, subsumption hierarchy generation, and document concept identification were done using Perl, SQL and a noun phraser, with the results being stored in an Oracle 10g database. DyQE itself is implemented in the Java language.

viii

Statement of Authorship

Except where reference is made in the text of the thesis, this thesis contains no material published elsewhere or extracted in whole or in part from a thesis submitted for the award of any other degree or diploma. No other person’s work has been used without due acknowledgement in the main text of the thesis. The thesis has not been submitted for the award of any degree or diploma in any other tertiary institution. The research in this thesis has been generously funded by the Australian government under a DEETYA SPIRT funded scholarship sponsored by the Australian Research Council, National Prescribing Service, and Therapeutic Guidelines Limited.

Dennis Wollersheim Friday, 3 June 2005

ix

Publications Arising From the Thesis Medical Informatics Wollersheim, D. (2001a). A Review of Decision Support Formats with Respect to Therapeutic Guidelines Limited Requirements. Ninth National Health Informatics Conference, Canberra, ACT, Australia, Health Informatics Society of Australia: 85-87. Wollersheim, D. (2001b). Therapeutic Guidelines Decision Support Knowledge Representation. 11th Annual Royal Australian College of General Practice Conference, Melbourne, Australia: 66. Wollersheim, D. and W. Rahayu (2001). Implementation of Dynamic Taxonomies for Clinical Guideline Retrieval. International Conference on Medical Informatics, Hyderabad, India, Institute of Public Enterprise: 103-111.

Information Retrieval Wollersheim, D. and W. Rahayu (2002). Methodology For Creating a Sample Subset of Dynamic Taxonomy to Use in Navigating Medical Text Databases. Proceedings of International Database Engineering & Applications Symposium - IDEAS 2002, Edmonton, Alberta, Canada, IEEE Computer Society: 276-284. Wollersheim, D., W. Rahayu, et al. (2002). Evaluation of Index Term Discovery in Medical Reference Text. Proceedings of International Conference on Information Technology and Applications, Bathhurst, NSW, Australia, IEEE Computer Society: 1-6. Wollersheim, D. and W. Rahayu (2003). An Algorithm For Finding Effective Query Expansions Through Failure Analysis Of Word Statistical Information Retrieval. Proceedings of The Third International Symposium on Communications and Information Technologies - ISCIT 2003, Songkhla, Thailand, Prince of Songkhla University: 367-371. Wollersheim, D. and W. Rahayu (2005). Using Medical Test Collection Relevance Judgements to Identify Ontological Relationships Useful for Query Expansion. International Workshop on Biomedical Data Engineering – IEEE ICDE 2005, Tokyo, Japan, IEEE Computer Society Press: 6774. Wollersheim, D. and W. Rahayu (2005 (In Press)). "Ontology Based Query Expansion Framework for Use in Medical Information Systems." International Journal of Web Information Systems 1(2). Wollersheim, D. and W. J. Rahayu (2005 (Accepted for publication)). On Building a DyQE - A Medical Information System for Exploring Imprecise Queries. 16th International Conference on Database and Expert Systems Applications - DEXA 2005. Copenhagen, Denmark, Lecture Notes in Computer Science (LNCS), Springer Verlag.

x

Acknowledgement We are born in a community; without it, we are lost. While it is only my name on the title page, this thesis could not have been written had it not been for the support I received, not only during the past four years of research, but also during my entire life. I dedicate this thesis to my grandmother, who taught me to love learning, in the way of best teachers, by loving it herself. Additionally, my parents always said, “we don’t care what you do as long as you are happy”; I could not have done this work if I had not figured out how to remain happy doing it. On a technical side, this work was funded by a scholarship provided by the Australian government in partnership with the Australian Research Council, the National Prescribing Service and Therapeutic Guidelines Ltd, in support of industry-academic collaboration. In addition to the monetary support, it was very useful to have the benefit of the focus provided by such an arrangement. In the area of academia, my work could not have proceeded without much support. In addition to all my colleagues in the La Trobe University computer science department, I would like to specifically thank Dr. Ken Harvey, Dr. Jonathan Dartnell, Dr Richard Huggins, Mary Hemming, Assoc. Prof. Teng Liaw, Dr. Bryn Lewis, and Elizabeth Deveny. In the area of tools, I would like to thank Dr. Kris Tolle and Dr. Hsinchun Chen for permission to use the Arizona Noun Phraser. On the social side, I have received much support from numerous friends, my co-counselling community, and my church. Of special mention is Louisa Flander; without her, this whole endeavour would not have even been an impossible dream. Finally, I must thank my principal supervisor, Dr. Wenny Rahayu. There are no words to describe the debt that comes from having someone think about me to extent that she does. I can only hope that I can pass on the favour.

xi

Chapter 1 Introduction This chapter describes the current state of information retrieval, both generally, and within the realm of medical decision support. This leads to an overview of the problem – that of information retrieval in response to imprecise queries. We point out how the current situation shapes the problem, and describe other methods of working with imprecise retrieval.

1.1

Targeted Information Retrieval

The amount of searchable information in the world is broadening, deepening, and diversifying. Examples of diversification include metadata inclusion and increased target audience specificity. In short, there is continually more data. This increased supply is matched and driven by an increase and broadening of demand; non-experts are encroaching on the experts’ turf, and lowering the average expertise of a searcher. Additionally, the wider population has started to expect high quality information. These factors drive the enhancement of search (Sevcik 2001). Traditionally, information retrieval IR has been focused on targeted search, the retrieving of discrete answers from known questions. The user has a question; the system has an answer; IR is the process of matching them up. This form of IR has been a great boon, as providing correct answers to specific questions is very useful. This is especially true in medicine, where clinicians are confronted daily with questions demanding answers. For example “What is the dosage for this drug?”; “What is the antidote for this poison?” Moreover, this is a task amenable to computation. Much energy has been expended researching in this area, and rightly so. Targeted search makes the interface between search and user discrete and clean. This simplifies both implementation and evaluation. The user asks ‘What is the answer to question A:’; the computer answers, ‘This is the answer to question A’. In this model, search is treated as a discrete standalone tool. On the other hand, targeted IR is not so useful when dealing with questions and answers that are less precise. The imprecise realm is better served by exploratory approaches to knowledge discovery, with a concomitant increase in user involvement. While imprecise search has elements in common with targeted search, there are also places where the targeted search tools do not serve it well. The basic desiderata of targeted search, correct answers and precise questions, are often missing from imprecise search (Lu and Rahman 2004).

1

1.1.1

Retrieval Types

Retrieval has been classified into three types, namely text, data, and knowledge (Lewis and Jones 1996). Text retrieval is characterised by lexical matching on a statistical or probabilistic basis, using unstructured queries against unstructured sources. Data retrieval more directly specifies what will be retrieved, precisely matching both in content and in structure. Knowledge retrieval extends the latter, using ontologies to perform deductive reasoning over structured resources. Even though imprecise search has traditionally been approached via the lower levels of formality, there is room for improvement. Retrieval operations at higher levels of formality are preferable because they are more precise, retrieving fewer irrelevant documents, and providing greater control over the process. Barriers to formality include preparation and upkeep costs.

1.1.2

Medical Information Retrieval

This thesis is aimed at retrieval of medical guidelines, thereby improving medical decision support by increasing accessibility. There exists a large stockpile of such guidelines. Typically, they are: •

stored in narrative text format,



semantically dense,



short, and



discrete.

While the first two factors imply a text retrieval approach, the latter two support data retrieval. This combination of types hints at the possibility of improving retrieval in response to an imprecise query. Medical decision support is becoming more computerised. Current generation tools include lexical searching and browsing of text, supported by indices and tables of contents. This trend is driven by growth of the amount of content, and a desire for better access. The medical content is rarely described formally, and queries even less often. This argues against a knowledge based approach. The factors which drive an increase in imprecise query especially impact upon medicine since this field is very complex, highly volatile, increasing in size, and very interconnected. Also, the democratisation of knowledge brought about by the web has especially impacted upon medicine, due to its very personal nature. The audience is expanding, and consists of fewer professionals. This despecialisation has brought many non-expert searchers to the field, decreasing the average knowledge per searcher, and increasing the need to develop tools to deal with imprecise query. The breadth of medical data, and user inexperience, means the user is often overwhelmed. Because the results of imprecise query are either too few or too many, in the absence of the possibility of an ameliorating action, i.e. support for interactivity, both these alternatives have the potential to deter the user from continuing to explore the answer space.

2

Additionally, many medical decisions still require human intelligence. It was a failing of early artificial intelligence to try to eliminate the need for human thinking. We argue that it is more useful to envisage a decision support solution that maximises, rather than replaces, human intelligence. In medicine, the barriers to the use of a complex interactive system are balanced by high motivation, due to the highly personal nature of the problem; there is a direct reward for diligence. People have high expectations of medicine, and this supports a high level of energy that users are willing to expend to resolve a problem. Because of this, the medical arena is a good place to experiment with interactive solutions, as the user is motivated to persevere in places where the interface is not optimum.

1.2

Overview of Imprecise Search

In the last section, we describe targeted search, and point out how focusing on this type of search ignores imprecise query. We argue that imprecise query especially impacts upon medicine, and this situation is ideally addressed by an interactive approach. In this section, we provide an overview of imprecise search, giving examples, and ways that it has been dealt with in the past. Imprecise query is a query which does not precisely specify the information needed, and therefore cannot possibly retrieve the exact documents that the user is seeking. This is due to a mismatch between the language of the query and the relevant documents. While most queries contain both precise and imprecise elements, here we focus on the imprecise facet of the IR process. It is understandable that this area has been neglected, because it is difficult to retrieve something that is not specified. Even given this basic impediment, we argue that there exist resources which increase the usefulness of IR to imprecise query. Imprecise query (IQ) is not a new phenomenon; in the real world, few queries are entirely precise. IQ is an important topic because the forces that drive it are increasing. Imprecise query originates with users who cannot sufficiently describe what they are looking for. This can be due to either lack of knowledge of the field, or lack of knowledge of self. The domain knowledge deficiency is driven by a number of factors. Information is becoming more complex, interconnected, and abundant, and the user base is broadening. While our work investigates how to best deal with imprecise query in the medical domain, we do not contend that we can take an imprecise query, and from that, fulfil the information need. Instead, we consider the imprecise query to be an indication of the user’s area of interest. From this, the user will continue to explore. Our research is oriented to analysis and evaluation goals, rather than pure ad hoc query.

3

1.2.1

Imprecise Medical Query Examples

Because we focus on IR in the medical realm, our discussion is illuminated by a look at medical query. This is aided by the work of Berrios et al. (Berrios, Kehler et al. 1998) where they categorise medical questions into 13 major categories. These categories were developed to be of use to medical experts; most medicine information resources are aimed at this target group. We repurpose these categories for use with imprecise query. The problem with these particular categories is that, as is the nature of categories, the boundaries are crisp. We want to use them in relation to imprecise queries, where, by definition, the boundaries are not crisp. This is a problem. The saving grace is that most precise query categories have a corresponding imprecise analogue. Examples of the query categories, and their imprecise analogues, can be seen in Table 1-1. Table 1-1 Major categories of medical query type and their imprecise analogues.

Code

Standard Precise Medical Query Type

Imprecise Query Example

1

What is the definition of X?

What is X?

2

What are the risk factors for X?

I am worried about X?

3

What is the aetiology of X?

Why did X happen?

4

Can X cause Y?

What happened?

5

What distinguishes X from Y?

How can I be sure that this is what is happening?

6

How can X be used in the evaluation of Y?

How can I ensure success?

(including diagnosis and follow-up) 7

How can X be used in the treatment of Y?

What can I do about Y?

8

How can X be used in the prevention of Y?

How can I stay healthy?

9

What are the performance characteristics of X in

How does X work?

the setting of Y? 10

How does X compare with Y in the setting of Z?

11

Is X contraindicated by Y?

What is happening to me? Is this normal? How bad will it get?

12

What are the sequelae and prognosis of X?

What will happen to me?

13

What are the physical properties of X?

What is X? What is X like?

This table shows where an imprecise medical query can fall into a precise category. Current targeted search tries to answer queries largely by looking specifically for the Xs and Ys of the above templates. I call these words the substantive words of the query. Given current search techniques, the other query words are mostly too common to make good search terms. This technique works

4

relatively well for precise queries, as the information needed is specific and focused. In the case of the imprecise queries, as one can see by the examples, the information need is not as focused. This means that, when solving the problem of imprecise query, we cannot merely go looking for the substantive words. We must instead approach the problem by looking at the semantic meaning of the entire query. For example, while the query “I am worried about body part X” is in the risk category, it will not be satisfied by a simple answer. Here, the user has fear driven by lack of knowledge, but the exact knowledge that will allay this fear is as yet unknown, even by the user. Far from being rare, questions of this sort drive much public interest in medical information. This query is not well served by targeted search; it would be difficult to make this query precise, and would be unlikely to present usable results. A general query of this type would return a very broad answer set, inhibiting the search conversation. This type of question demands a new paradigm. On a slightly more specific note, there are queries of the type “What is diabetes?” This query could be either type 1 or type 13, and while the user might be satisfied by the definition “any of several metabolic disorders marked by excessive urination and persistent thirst” (Wordnet 2004), it is probably a more open ended question, with a multitude of answers. Google returns 13.4 million hits for diabetes, and this magnitude of answer set alone would make it difficult to fulfil information need. Also noteworthy here is the fact that such a query transcends the crisp category boundaries, and so determining the exact nature of the query requires user involvement. In the next example, we set out a series of steps taken by a successful exploration which resolves an information need related to an imprecise query. We choose the imprecise query, “What can I do about Y?”, which has the precise query of, “How can X be used in the treatment of Y?”. The answer to the precise query is straightforward; either, “No, it cannot be used”, or, “Yes, it can be used; this is how”. The information desired in response to the imprecise variant is not so clear. In our scenario, the user has a condition Y, and is making a fear-motivated enquiry. In this case, the precise response would be insufficient. The user would be more satisfied by an initial set of answers including the range of treatments for Y, with this result set strengthened by an investigation of the details surrounding the treatments. This shows a case for a broad result set, supplemented by exploration tools.

1.2.2

The Problem of Imprecise Query

As noted earlier, IQ can be driven by lack of domain and/or self-knowledge. While domain knowledge deficiency could possibly be addressed by a purely computational solution, it is unlikely to satisfy the self-knowledge deficiency. This is a real problem; if the user does not know what s/he

5

wants, the resulting query is likely be an imprecise specification of his/her information need. This problem is exacerbated by user inexperience. Imprecise query has been overlooked in the field of information retrieval. It is more difficult than precise search, because the problem is not as well defined. Also, it requires interaction with the user, something that is difficult to both do and evaluate. When faced with an imprecise query, the current generation of search tools do not behave well. The returned result is unusable, being either too small or too large. There is an expectation that users will themselves read through the results, extracting more precise search terms. This puts the onus on the users to reformulate and resubmit their original query. As imprecise query is characteristic of relatively inexperienced and therefore less confident users, this step can lead to an aborted search process, which clearly does not fulfil the users’ information need. Given the above formulation of the imprecise query search process, it is clear that there is an enhanced role for automation here. The user, scanning the retrieved results, is attempting to recognise a set of search terms that will make his/her query more precise. This is a valid goal; recognition memory is much more powerful than recall, and recall has already failed them, as shown by the nature of the original imprecise query. The key point here is that the user is searching for terms related to the original query in some way. A solution would be to use existing relationships to illuminate the area around a topic of interest, making this space available for exploration. If this space is defined as concept and document relationships, we will need both ways to access these connections, and ways to deal with them. Recent trends in information retrieval, such as Google’s page rank, Vivisimo’s on the fly clustering, and Amazon’s recommendation system, all derive their advantage from a leveraging of relationships. This supports the credibility of a relationship centred approach. A focus on relationships already exists in the search field, for example in the semantic web, and in query expansion.

1.2.3

Approaches to Imprecise Query

The problem of imprecise query has been addressed under the rubric of text retrieval, for example query expansion, but these solutions do not widely incorporate the advantages to be gained from data or knowledge retrieval. Specifically, there is an increasing amount of available semantic structure, of which more formally oriented solutions can opportunistically take advantage. There are organising forces which address the problem of imprecise query: ontologies, xml tagging, and structuring and indexing of both documents and queries. There are chaotic forces that exacerbate this problem: increase in variety and amount of content, and decrease in average user knowledge due

6

to broadening of user population. This work tries to potentate the organising forces, while keeping in mind that any answer must account for the chaotic factors, which ensure no fixed solution. The best resolution is one that will be amenable to a continually changing landscape. We note that, while the organising forces have potential application to the imprecise query problem, they are not yet fully developed solutions. Even so, it is of little use to wait until they are fully developed, if only because of the fluid landscape. Instead, this work looks at maximising the value that can be opportunistically extracted from the current iterations of the organising tools. A. Web Search and Imprecise Query Worldwide, web search has become very important. The current search tools, eg Google, are highly functional, but they offer limited usefulness to imprecise query. The current approach to search is lexically rigorous, in that it only returns results which contain query words. This does not translate into semantic rigor, probably because semantic rigor is more difficult. We lose semantic rigor due to individual word ambiguity, and ambiguous relationships between individual words. Even where there is an implied meaning, such as a particular combination of words in a particular order, the current web search cannot determine, let alone search for, that meaning. This is not helped by the paucity and low quality of existing document structure. Even given the constant innovation in the field of web search, current practice gives us little specific assistance to help with imprecise query (Sherman 2002). The semantic web promises to revolutionise precise search. Current search uses mostly text retrieval, that is, lexically matching words from a query against the text in a document corpus, and retrieving statistically promising matches. The semantic web proposes adding structure to the documents, allowing the use of a more direct data or knowledge retrieval paradigm. This is enhanced further through the use of ontologies which transmute the system into a form of knowledge retrieval, using the formal structure to provide logical reasoning capabilities over a structured information store (Berners-Lee 1998; Berners-Lee, Hendler et al. 2001; Ewalt 2005). The promises of the semantic web have yet to be fully realised. Resources are yet not well identified, and ontologies insufficiently developed. Even when we can lexically recognise document content, meaning and functionality is ambiguous; we do not know the precise information need that a document serves. Also, there are questions as to whether these promises are even possible. Ontologies can have difficulty providing functional universal usefulness, given that the world is hard to strictly define. For example, the rule “dogs have 4 legs” falls over when encountering a dog with 3 legs. At the very least, there is research to do before we can fully exploit this resource.(Shirky 2003) (Doctorow 2001).

7

The transition from the current situation to the semantic web will be gradual. In this thesis, we look for a solution which opportunistically makes use of the existing semantic resources, taking the next step along this gradual path. An operational feature of the semantic web is its ability to use ontological resources to perform logical reasoning about identifiable items. In the current situation, because the items are not identifiable, it is also unreasonable to expect logical reasoning. Even so, there are still usable features of this vision. Instead of using ontologies in a formal context, we suggest that there is value in exploring their use in a less rigorous way. This is supported by the fact that ontologies are becoming more prevalent, and deepening (for example, see (Lenat 1995)). Also, we are getting better at identifying text resources, improving potential for ontological connection. By connecting an ontology to identified text, the text gains value, by expanding its intra-document connections, and by connecting it to a worldview. This has the effect of lengthening the information retrieval reach of a query, and increasing the number of connections available to a single text point, broadening access. A problem then arises due to the current lack of rigor; greater reach also means greater irrelevance. We can identity text, and we can connect it to ontologies, but the result, while powerful, is not without error. We argue that the solution must then be a two step process; firstly, connecting text to ontologies; and secondly, providing tools to deal with the resultant query overshoot. This vision is similar to an IR technique called query expansion (QE). This is where words related to an original query are added into the query, and the query rerun. While this is more directly attacking the problem of imprecise query than the semantic web solution, it still has drawbacks. When done in an automatic fashion, it does not improve the precision of the original query. Query expansion has been widely explored, but mostly from a simple text retrieval perspective. This response ignores the potential available from approaching QE from a data retrieval approach, in the areas of usefulness, and evaluation. B. The Use of Interactivity when Dealing with Imprecise Query A final aspect of imprecise query retrieval is interactivity. Most prior thinking around retrieval envisages the user as having a fixed information need. Even when it is specified imprecisely, there is still the idea that the user knows what s/he wants. This view is understandable, because it reduces the complexity of the problem. But it is simplistic, complicated by the dynamic nature of human beings, and contrasts with research which shows that people often arrive at their information need through participation with the retrieval system (Bates 1989). While this does not mean that we should ignore the user’s query, it does lend weight to interactivity.

8

Because of this, imprecise search will not be as straightforward as with targeted search. Instead, it will be an exploratory process. Like targeted search, it still begins with a question, but the question is more general. Such a question is not well served by targeted search tools; the lack of detail would likely return either too many or too few relevant results. Because there is insufficient information in the query, we must involve the user in the decision making process; this means designing in support of interaction, and resisting the pull towards a purely computational solution. Imprecise search must be integrated into a framework which promotes knowledge exploration and discovery. It must involve the human brain, which is a computational device better suited for this type of problem, both for refining the search, and recognising the elements of interest in the dataset. Such search is expected to consist of a series of transactions, characterised by non-expert users recognising retrieved elements of interest, rather than experts recalling specific queries. As in targeted IR, there is a need to get results to the user. But here, raw results are probably insufficient, because they do not do enough to engender the type of conversation we are looking for. The system should of course provide a simple answer when it exists, but when it does not, it should engender hopefulness by giving other options. This means providing tools that deal specifically with the “too much data/too little data” problem. In summary, such a system should be sticky, offering strong information scent, and encouraging exploration. Pure retrieval only addresses one part of this problem; a better solution would combine retrieval with tools that support an interactive traversal of the problem space.

1.3

Objectives

The objective of this thesis is to look at retrieval through the lens of imprecise query. We search for ways to opportunistically use the emerging semantic resources to improve retrieval. That said, imprecise query covers a broad swath of computing. Much of the work of getting information from the computer to the user involves imprecise query. As such, we must limit our scope. While our focus will be retrieval, we draw information from classic text retrieval, and query expansion. On the other hand, there are many applicable areas which we will not explore, including much of artificial intelligence, for example neural nets and fuzzy logic. The thesis will investigate the places where conventional medical retrieval fails, and then incorporate query expansion techniques into a data retrieval framework to improve overall medical retrieval. Because we use ontology as the basis for our query expansion, our result sets have a semantic and structural orientation. However, the breadth of the expansion can produce an overabundance of

9

results. This motivates our work in hierarchical browsing structures, which are used for navigating and filtering the expanded result sets. Specifically, this thesis addresses the following objectives: •

Discover the best opportunistic use of current ontological resources to enhance IR of imprecise queries



Develop and test query expansion algorithms that work in concert with the above ontological resources



Find ways to help the user deal with the imprecision that results from the above, also based on the structure found in existing ontological resources



1.4

Develop a usable system that integrates the above methods.

Thesis plan

The plan for the thesis is as follows. An overview can be seen in Figure 1-1. Chapter 2 is a literature survey, covering the range of literature needed to support my thesis. It begins with a look at medical decision support, specifically computer implemented guidelines. This is followed by a review of classic text retrieval techniques, and then, a comprehensive look at various forms of query expansion. Finally, we appraise interactive query expansion systems. Chapter 3 talks about the problem definition, going into a detailed description of the reasons it is appropriate to do this work. First, we talk about why this problem is especially acute in medicine, and then examine the deficiencies of standalone text and data retrieval. We formally define the parts of our solution, and then justify the internal and external pieces: query expansion, and dynamic index based browsing. This chapter ends with a list of the medical data sources upon which we will conduct our experiments. Where Chapter 3 is an overview of the problem, Chapter 4 is an overview the solution. Here, we lay out the map that will be followed through the remainder of the thesis. We start with a section on query expansion, proceed to dynamic interactive summarisation, and finish with a synthesis of the two. Also included are sections on system benefits and system evaluation. In Chapter 5, we set out the query expansion evaluation framework, a framework within which we can exercise and evaluate various query expansions. In addition to the framework, we detail a range of plug-in query expansion algorithms to use with our framework, and enumerate a variety of ontology and test collection specific variables that could influence the success of query expansion.

10

(1) Introduction (2) Literature Survey (3) Problem Formulation (4) Solution Overview

Query Expansion (5) Query Expansion Evaluation Framework

Interactive

(7) Dynamic Indexing

(6) Query Expansion Evaluation Results

(8) Dynamic Query Expansion (DyQE) Case Study (9) Conclusion

Figure 1-1 Thesis plan.

After this detailed design, Chapter 6 goes on to describe the implementation of the framework. The query expansion plug-ins are exercised, and a detailed statistical analysis of the results are presented. Chapter 7 discusses the details of implementing a dynamic indexing on arbitrary medical text. This chapter has 2 sections; the first compares the success of various algorithms at finding medical concepts in medical guideline text. The second section details an algorithm for creating a subsumption hierarchy based on the concepts found in medical text. Chapter 8 presents a synthesis of query expansion and dynamic indexing. Here, we detail an implementation of a system that combines the query expansions found in Chapter 6 with the dynamic indexing structure instantiated in Chapter 7. Chapter 9 is the conclusion. Here, we present a summary of what has been achieved in the thesis, and an indication of future work remaining in this field.

1.5

Conclusion

This chapter gave an overview of information retrieval, focusing on different levels of retrieval formality. This was followed by an exploration of the difference between targeted and imprecise search, which led to our examining the current approaches to imprecise search, focusing specifically

11

on medical query, and the use of interactivity. This general description is followed by a set of objectives of the thesis, and a plan for their implementation. Where this chapter gave a general summary of the imprecise query field, the next chapter provides a targeted review of existing work in the area. We start with a look at medical retrieval, and then give a bird’s eye view of the information retrieval field itself. We then comprehensively review query expansion, both in its algorithmic and interactive forms, providing a summary of all existing work in the area of imprecise query. This review supports the work covered in later chapters.

1.6

Bibliography

Bates, M. J. (1989). "The design of browsing and berry picking techniques for the online search interface." Online Review 13(5): 407-424. Battelle, J. The Search : The Inside Story of How Google and Its Rivals Changed Everything, Portfolio Hardcover. Berners-Lee, T. (1998). Semantic Web Road map, http://www.w3.org/DesignIssues/Semantic.html. Berners-Lee, T., J. Hendler, et al. (2001). The Semantic Web. Scientific American. Berrios, D. C., A. Kehler, et al. (1998). Automated Text Markup for Information Retrieval from an Electronic Textbook of Infectious Disease. AMIA 98 Annual Symposium. Doctorow, C. (2001). Metacrap: Putting the torch to seven straw-men of the meta-utopia, http://www.well.com/~doctorow/metacrap.htm. Ewalt, D. M. (2005). The Evolution Of Web Search, http://www.forbes.com/technology/2005/04/06/cx_de_0406semantic.html. Lenat, D. B. (1995). "Cyc: A Large-Scale Investment in Knowledge Infrastructure." Communications of the ACM 38(11): 33-38. Lewis, D. D. and K. S. Jones (1996). "Natural language processing for information retrieval." Commun. ACM 39: 92-101. Lu, Z. and U. Rahman (2004). Intelligent Search Technology for Extracting Scientific Data/Documents on Web. International Conference on Internet Computing 2004, Las Vegas, Nevada, USA, CSREA Press. Sevcik, P. (2001). Internet Bandwidth: It’s Time For Accountability. Business Communications Review: 10-11. Sherman, C. (2002). Google's Gaggle of New Goodies, http://searchenginewatch.com/searchday/article.php/2159971. 2004. Shirky, C. (2003). The Semantic Web, Syllogism, and Worldview, http://www.shirky.com/writings/semantic_syllogism.html. 2003. Wordnet (2004). Web Wordnet 2.0, http://sln.fi.edu/biosci/glossary.html. 2004.

12

Chapter 2 Literature survey 2.1

Introduction

In Chapter 1, we described the content of this thesis in broad strokes. We noted the general direction we would be heading, traversing a path between imprecise query and relevant documents. This chapter reviews the literature dealing with imprecise queries in medical information retrieval. Because imprecise retrieval is a broad topic, the review will cover a wide domain. The thesis covers an arc starting with the problem of imprecise query of medical guidelines and ending with a framework for evaluation of ontology based query expansion, and an implementation of best case expansion using a hierarchical browser based paradigm. In this chapter, we point out the dominant features of the landscape between these two points, noting both where we are going, and where we are not going. Because imprecise query is a classic problem in computing, there are many solutions. This chapter covers a range of current approaches to the problem, and points out weaknesses therein. First, we set out the major points covered by the literature review. First, we set out the basis for our solution, which will justify the end point of the literature survey. With that in hand, we will cover the supporting elements. The first thing we do in the survey is to set out a framework which holds the current set of approaches to the problem of imprecise query. Imprecise query is a query that does not precisely retrieve the set of desired documents. Notwithstanding this, imprecise query does retrieve some documents, and once such a query has retrieved a set of documents, one or both of the following problem aspects will be true. They are: •

we have not retrieved all the documents that we desire, and/or



we have retrieved documents that we do not desire.

This situation, using the imprecise query to retrieve a set of documents with an imperfect result, has been approached in many ways. In this review, we present these approaches. In general terms, we break the problem into three components: A) how to best translate the query into relevant documents; B) how to obtain more relevant documents; C) how to eliminate irrelevant documents.

Much imprecise query effort has been focused on solving A, and, because of that, we do not spend much time here. B has been approached via the tool of query expansion, increasing the number of

13

ways to get from a query to a document. The problem here is that query expansion is a blunt instrument, showing the complementarity of the two problems. Often query expansion turns (B) into (C), through the addition of irrelevant documents. So, after query expansion, we still have to filter the results, often by browsing. The problem with browsing is that its utility declines as the number of documents increase. This problem has been approached by a subsequent ordering and active filtering of the aggregate result term set. As such, we choose to explore the state of interactive result set exploration in an answer to (C). In summary, the major focus of this review will be the areas of query expansion, and interactive result set exploration. But before that, it is useful to examine the background work that supports this path, and point out other potential paths that could solve this same problem. We start with the motivations which drive our exploration of imprecise query. To this end, we look at the general area of medical information provision, specifically the decision support related delivery of medical therapy guidelines. Because a major tension in medical guideline provision relates to the level of information representation structure, we cover guideline provision from a structural point of view. This specific treatment of medical guideline structure is supported by more general coverage of both formality in medical informatics, and structure in information retrieval. Based on the structure arguments, and the fact that most medical guidelines are text based, we proceed to examine the area of text retrieval. The queries in traditional text retrieval are often imprecise, and they have evolved many solutions to the problem. While this thesis does not use much classic text retrieval, these techniques underlie much of query expansion. After this background information, we review the literature in the focus area of the thesis. We first look at the different varieties of query expansion (QE), ranging from corpus based QE, to thesaurus generated QE. This leads to a look at ontology assisted QE, using both general and medical ontologies. Finally, we examine the realm of interactive ontological QE, specifically focusing on the use of dynamic summarisation to visualise and manipulate retrieval results.

2.2

Structure in Medical Information Retrieval

The following section looks at the role of structure in medical information retrieval. We first explore this topic in the area of medical guidelines, and then look at the overall medical informatics area.

2.2.1

Structure in Medical Guideline Provision

We discuss the range of structure involved in different methods of computer based medical therapy guideline provision. Such guidelines range from single line recommendations to complex multipart series of steps ranging over time. We will not cover computer-aided diagnosis; because of its

14

complexity, it is not a straightforward retrieval issue, and our focus is on imprecise query, not imprecise results. Retrieval of guideline information began in medicine in the same way as in many other fields, that is, retrieval from books. Here, the retrieval was imprecise because the query could only be defined in natural language, which is an imprecise medium. Currently, the balance from book information to computer-based information is changing; computerisation of medical guideline information is being driven by a increasing amount of, and access to, medical information, and a push to improve clinical practice. Computer implemented guidelines (CIG) take on many forms. In order of increasing structural organisation, CIG types range from archived text, hypertext (HTML) based guidelines, descriptive CIG, and algorithmic CIG. We make these divisions as a way to categorise medical guidelines for the purposes of criticism. A. Text Based Guidelines The lowest level of structure functions in a fashion similar to books, with the text merely transposed to a computer screen. A hypertext-based implementation provides more interactivity. The example in Figure 2-1 shows a system that includes a hyperlinked table of contents, an index, and a full text search. B. Described Guidelines The next level of CIG organisation is ‘described’ text. Examples of this can be found in the guideline mark-up languages HGML (Hagerty, Pickens et al. 2000) and GEM (Shiffman, Karras et al. 2000). Structure is added via mark-up tags, and the original text is left intact. Arbitrary mark-up tags mean that these formats can represent a wide variety of information, ranging from micro guideline detail to broader summary data. Currently, there are few applications for this type of data, and as such, little standardisation in the languages. A key contribution of descriptive guidelines is that they have enumerated the types of information that make up a guideline. The difference between HTML and descriptive CIG is that the structure in the HTML guidelines is used only for formatting and linking purposes. Descriptive CIG uses the mark-up structure to describe the semantics of a guideline. C. Algorithmic Guidelines Algorithmic CIG is the most structured level of CIG. It is a set of computer instructions that allow machine execution of a guideline. Current generation of algorithmic CIG usage follows strict execution paths, similar to low level computer programs.

15

Figure 2-1 Hypertext HTML version of text guidelines.

Much energy in the CIG community has been focused on the algorithmic realm. Examples include Proforma (Fox, Johns et al. 1998), GLIF (Peleg, Boxwala et al. 2000), Prodigy (Sugden, Purves et al. 1999), and the Arden Syntax (Jenders 2000). This work has solved some important problems, such as: •

the usefulness of separating the guideline knowledge base from the execution engine;



the use of ontologies to represent medical knowledge;



tight integration with patient data; and



some strategies to grapple with problems such as incomplete patient data.

A key feature of algorithmic CIG is that it is executable. This is defined as taking a set of inputs (including patient data and clinician choices) and returning some output (such as a printed prescription, or other recommendation). This procedure can be arbitrarily complex, and so, it is especially useful in places where the human mind gets confused, in, for example: •

multi-factor decisions;



decisions that have duration over time; or

16



2.2.2

multi-part regimens.

Formality in Medical Informatics

The level of guideline structure described above relates to a level of formality necessary when working with them. Use of highly structured guidelines demand a high level of formality. The choice between different guideline formats is illuminated by some recent work in medical formality. Pereira (Periera 1996) studied the role of formality in the area of medical language processing. He argues that formal methods are both too strong and too weak; too strong in that they do not allow the necessary flexibility, and too weak because they cannot cover all the possibilities, and necessitate continual refinement to cover all the cases. He suggests a return to empiricism, focusing on what works, without the necessity of formal structure. This logic applies to medical decision-making. Coiera (Coiera 1997) (Coiera 1998) discusses formality in general medical informatics, presenting a set of principles determining appropriate use of formality. The problem with formalisation is that it is expensive, both during implementation, and during use. The use cost is related to decreased accessibility, and cost of keeping current. The latter point occurs because once something is formalised, it starts going out of date immediately, because formal content is rigid, and the world it describes is constantly changing. This is exacerbated because we do not currently have enough information to know what is a proper tool for any audience of health care provision, and many medical informatics systems were not usable as delivered (Berg 1999). Because of this, Coiera recommends informal treatment for simple and/or new tasks. As an additional point, because an increase in formality means a decrease in access, formality should be minimised, thereby maximising accessibility. On the downside, the problem with informal artefacts is that the data interpretation rules are external to the system. For example, in simple IR, the results must be assessed by the user. This problem can be ameliorated through the use of external tools that support interpretation. In Chapter 3, we argue extensively that algorithmic CIG is not suitable for our needs. Here, we merely note that only algorithmic CIG needs a precisely defined input specification, lacking in many medical decision making situations. The alternative to algorithmic CIG is to treat guidelines in a textual fashion, and apply the tools of text retrieval. General text retrieval algorithms have been used for decades, and have evolved many ways to deal with imprecise query.

2.3

General Information Retrieval

It could be said that all information retrieval is an attempt to solve the problem of imprecise query. Because of this, this section will have only limited coverage, and not discuss all possible solutions to

17

the problem of imprecise query. We begin by continuing to discuss formality, but now in the more general context of information retrieval as a whole. Next, we talk about using both structural and semantic cues for retrieval. We finish this section with an overview of classic text retrieval.

2.3.1

The Role of Structure in Information Retrieval

The sentiments around formality in medical IR are illuminated by the more general work of Lewis et al in the area of general IR structure (Lewis and Jones 1996). The purpose of this section is to show where our overall thesis fits into a model of information retrieval structure. Lewis et al divide IR into three basic retrieval paradigms: text retrieval, data retrieval, and knowledge retrieval. Each has a different structure, and different searching requirements. Text retrieval is document- or passage-based, around a general topic. This type of retrieval normally uses wordstatistical methods to find exemplars of a specified topic. Neither query nor document is required to have structure, and this means that the query will be imprecise. Because of this, search often has an exploratory nature. An advantage of text retrieval is that it does not require encoding. Data retrieval, on the other hand, is characterised by direct retrieval. This is the type of retrieval found in database retrieval. The data is modelled as a finite, first order structure. Both queries and the data are highly encoded. The advantage of precision is weighed against the high cost of encoding, both of the document and the query. This solution requires a much higher commitment than text retrieval, and this level of structure needs constant upkeep. The third type of retrieval is called knowledge retrieval. It operates by connecting up information with a formal structure, for example, an ontology, which adds a layer of indirection to the retrieval process. It too is a direct retrieval, but is more powerful than data retrieval, for example, in the area of question answering and dynamic classification. It can be potentiated by connecting to a powerful formalism such as description logics (Guarino, Masolo et al. 1999). While data acquisition costs can be lowered via techniques such as frame based identification, the level of formalism necessary is costly and difficult to setup. On the positive side, it can use an even more expressive query and data language. The downside of knowledge retrieval, along with that associated with data retrieval, is that it is difficult to design the necessary ontologies. Each of these is a pure model, but combinations of these do exist. For example, semi-structured text can use a combination of data and text retrievals. These models inform the discussion around which retrieval modality is most appropriate. In our situation, no one model is best suited. Guideline material could be retrieved by any of these, but they would all have costs. Text retrieval is the normal approach in the case of imprecise queries, but does not provide the power of the others. Also, in our case, retrieval of brief guideline chunks, provides less than optimum depth for pure word-statistical

18

methods. Data retrieval demands a level of structure that is not currently found in the guideline material. Knowledge retrieval is promising, but again, guideline material is not sufficiently structured, or ontological resources sufficiently advanced, to permit a pure application of these methodologies. We propose a combination of text and knowledge retrieval, combining text retrieval flexibility with knowledge retrieval extensions, and building a user interface that continues to allow exploration. We will work empirically, opportunistically using structure found in both the guidelines and existing ontologies.

2.3.2

Structural and Semantic Imprecision

Information contains both structure and semantics. By structure, we mean the location of the information. Semantics refer to the meaning. Where the text retrieval methods take no account of structure, the data and knowledge retrieval areas have some cognizance of structure. The power of data retrieval depends largely on the data structure, where the power of knowledge retrieval comes from the combination of data structure, and the structure of the ontology connected to it. Text documents contain both structural and semantic resources. A query can specify either of these, possibly inexactly. Structural imprecision is the situation where we do not have an exact specification of the location of a desired item. Semantic imprecision, on the other hand, is where the query specification is an inexact specification of the meaning of the desired item. Structural imprecision exists. The majority of queries have limited or lacking structural specification, which is imprecision by absence. In the past, this has been matched by a similar absence in the text resources, which contained not much more than linear structure. This situation is changing. As structure increases, for example as guidelines transit from archived text to described CIG, techniques that deal with structural imprecision will become more useful. An overview of structural retrieval can be seen in (Suciu 1998). The retrieval problem can be dealt with using techniques such as proximal nodes and others (Navarro and Baezayates 1997), (Baeza-Yates and Navarro 1996; Baeza-Yates and Navarro 2002), (Abiteboul 1997) (Dongwook, Hyuncheol et al. 1998) (Buneman 1997), or any of the various XML query techniques, reviewed in (Luk, Leong et al. 2002). This thesis will not cover structural retrieval. Structurally tagged medical guidelines are not yet prevalent; there is more need, and ample opportunity, to focus on semantic imprecision.

2.3.3

Classic Text Retrieval

In this section, we break classic text retrieval into 3 sections: Boolean, vector space, and probabilistic retrieval. We first discuss each one alone, and then we compare them.

19

A. Boolean Retrieval In the classic information retrieval, there are three main methodologies. The first, Boolean retrieval, is a simple, clear, and intuitively obvious formalism, based on set theory and Boolean algebra. Here, a query modelled as a set of index terms separated by the connectives AND, OR, and NOT. This strict logical formula is then used to precisely retrieve documents. Note that this formulation is not amenable to imprecise queries; the query must be exactly specified to get the correct document set. More often, query imprecision retrieves too few or too many documents. This lack of capability to handle imprecision lessens the usefulness of Boolean retrieval. Also, Boolean retrieval does not rank results; each member of the result set is fully correct. Ranking can be added to classic Boolean by counting match extent, but even this does not take into account the relative importance of different query terms (Baeza-Yates, Ribeiro-Neto et al. 1999). In practice, the Boolean model acts in a fashion more similar to data retrieval than text retrieval. As such, it is considered to be the least powerful of the classic models. While it has intuitive appeal, it does not have the flexibility necessary for imprecise query. B. Vector Space Model The second classic model is called vector space retrieval (Salton and Lesk 1968) (Salton, Wong et al. 1975). It treats both the documents and queries as vectors in an N dimensional space, with N being the number of distinct concepts in the system. Cosine distance between the vectors determines the similarity between queries and documents. Distance in any dimension is determined by weights associated with the terms. Weights are often calculated based on term frequency and inverse document frequency. C. Probabilistic Retrieval Classic probabilistic models are based on the idea that terms that are in relevant documents are given a higher weight than if they were not in relevant documents (Crestani, Lalmas et al. 1998) (van Rijsbergen 1979; Harman 1992). The problem with these is that there must exist some relevant documents to begin with. Later work by Croft and Harper (Croft and Harper 1979) presented methods to use probability retrieval without any prior relevant documents. Probabilistic models are not as broadly used in the wide world as vector space models. Vector space models, as they were originally conceived, were more functional. Current generation probabilistic and vector space models have comparable performance. Furthermore, probabilistic retrieval’s strong mathematical foundation makes it more elegant, and potentially more powerful than vector space models.

20

D. Compare and Contrast Classic Text Retrieval These classic models have existed since the ’60s, and since then, there has been much progress, with many extensions and refinements. The criticisms that follow still apply to the extended models. Boolean retrieval, with its demand of exact document terms, is not useful for imprecise retrieval. Both probability and vector space models are better at this task because they rank retrieved documents according to their semantic closeness to the original query and to the importance of the query terms. For our situation, the tools offered by these basic text retrieval algorithms are not ideal. The first problem is that users must generate the list of the words that they are searching for. There is no place for recognition. If the word chosen by the user is not present, the document is not retrieved. This is known as the “vocabulary problem”, where the user must know the exact words in the documents being searched. The words must be specific and of sufficiently high weight. In imprecise query, this is not the case; by definition, the user does not know these words. The second problem is related to the first. When the user searches for words that are too general, or where the document base is too big, too many documents are retrieved, and the user is faced with information overload. In some queries, the best words that describe a relevant document set are in and of themselves too general, and impossible to refine. The user ends up dealing with huge lists of documents. Even if the documents are ranked, due to practicality, the user will view only the top ranking documents; some relevant results will fall off the bottom of the list. Finally, the last criticism is based on opportunism. The success rate of these techniques has plateaued, and there will be little to gain from further investigation. Further success will come from entirely different methods.

2.4

Query Expansion

Query expansion (QE) has been sought as a solution to the problem of imprecise query since the 1960s; it is an intuitively appealing idea, and as such, it has a long history. The first formulations were not very successful, and so QE has a long and varied history. The popularity of QE stems from its usefulness in solving the ‘vocabulary problem’, where a user query does not consist of the same words as is contained in the relevant document set. This potential usefulness has been shown in various studies; one from the medical field follows. Hersh et al (Hersh, Hickam et al. 1994) (Hersh, Crabtree et al. 2002) studied the relationship between a query and a set of relevant documents, looking at both the set of documents retrieved by a vector space IR system, and the relevant set not retrieved. They found support for the use of query expansion, by looking at the set of relevant documents not retrieved due to query term absence. In

21

these documents, terms in hierarchical or synonymous relationship to the query terms were often present instead. Query expansion could improve retrieval by including these terms in the query. In this review, we look at QE from two points of view. First, we examine the algorithmic basis of pure QE. This type of QE tends to run without any user intervention, and is presented to show the breadth of the QE field. Next, we go on to review interactive solutions to QE, systems where QE solutions are integrated into a user interface, allowing the intelligence to supplement algorithmic. A breakdown of QE frameworks and techniques is shown in Figure 2-2, while features are outlined in Table 2-1. Query Expansion (QE)

Algorithmic QE frameworks

Corpus based QE

Relevance feedback

Automatic global analysis

Automatic local analysis

Interactive QE techniques

Relationship based QE

General thesaurus based QE

Medical QE using UMLS

Ontology based IR

Control enhancements

Multiple access modality

Query refinement

Category based filtering

Figure 2-2 Hierarchy of query expansion feature sets

22

Simple subsumption QE

Feedback techniques

Query categorisation

Multidirection QE

Location Query in preview hierarchy Result set summarisation

Table 2-1 Features of various query expansion strategies

Algorithm Class

Algorithm

Positives

Drawbacks

Enhance IR by maximising retrieval effectiveness

Relationships are arbitrary and not necessarily comprehensible, and there is no use of semantics

Relevance Feedback

Uses feedback

Needs initial good document

Local Analysis

Can work as a filter on thesaurus relationships

Needs initial good document

Global Analysis

Uses entire corpus

Must pre-traverse corpus

Enhance IR using recognisable semantic relationships

Not tuned to the corpus

General Thesaurus Based QE

Recognisable relationships

Promiscuous; no filtering

Ontology Based IR

Fast, correct

Strict; terms must match, relationships are strictly logical

Medical QE using UMLS

Uses existing corpus, comprehendible

Some UMLS relationships are not useful for QE

Facilitate exploration by empowering user

Demands user attention

Dynamic Taxonomy

Uses taxonomy, query preview

Minimal QE, no query facility

Cat-a-Cone

Multiple access modality

No query categorisation or multidirectional QE

Flamenco

Highly hyperlinked

No query categorisation or multidirectional QE

Dynacat

Medical domain

Fixed queries

Corpus Based Algorithms

Relationship Based QE

Interactive QE

23

2.4.1

Algorithmic QE

For QE to function, it needs a source of relationships, relating query terms to the expanded terms. Algorithmic QE is QE which works using little or no input from the user, apart from the original query. We divide our review of algorithmic QE into areas based on the relationship source. The first area uses unstructured relationships derived from a document corpus analysis, while the second uses hard coded relationships from human sources, such as a thesaurus or an ontology. A. Corpus based QE Corpus QE has three general streams: relevance feedback, local analysis, and global analysis. Relevance feedback uses user interaction to determine a few relevant documents, and then retrieves more like these. Local analysis does a similar thing automatically, looking for relationships from among the documents initially retrieved. Global analysis obtains inter-term relationships from the entire corpus, and expands the query from these. i.

Relevance Feedback

The first corpus analysis derived algorithms is called relevance feedback (Salton and Buckley 1990; Harman 1992). Here, documents are retrieved according to one of the standard IR methods, and then the user chooses relevant documents (from among the top 10 or 20 document retrieved), the terms of which are then used to bias the initial query. This reformulated query is then rerun, and the process continues. The bias can either expand the query, adding in terms, and/or reweighting of query terms. The methodology used to modify the query to make it accord with the identified relevant documents can be based on either the vector or probabilistic models. In its favour, this algorithm provides a recognition phase that involves the user. The user identifies documents that would contain useful keywords, and the computer derives the keywords from the documents. Relevance feedback is good for retrieval from within large collections: the ‘needle in a haystack’ search. Relevance feedback is possibly not very useful for guideline retrieval. Here, it is possibly the case that, due to the focused nature of the passages, there would be no documents to fulfil the role of ‘initial relevant set’. This could entail situations where all initial documents would be equally poor. Additionally, nearby semantic jumps are not catered for. To perform this task in relevance feedback, the user would have to reformulate the query, because choosing single interesting documents at the feedback stage would not greatly bias the original query direction.

24

ii. Automatic Local Analysis Similar to relevance feedback, local analysis expands a query based on the context established by the set of documents initially retrieved as a result of a query (Attar and Fraenkel 1977). This is done by retrieving the top ranking documents for the current query, and then modifying and rerunning the current query based on this retrieved content. This technique also can be used to refine expansions generated by pure thesaurus or co-occurrence based relationships, by looking at the local context of the current query. Xu and Croft (Xu and Croft 1996) refined local analysis techniques by incorporating ideas from global analysis into a local analysis algorithm. This included retrieving and searching for passages, not documents, and comparing similarity of candidate expansion terms to the entire query rather than individual query terms. The criticisms of local expansion are similar to those of relevance feedback. It would not perform well in a diverse collection consisting of small text chunks. Even worse, it provides no interaction, assuming that the original query defines the information need. iii. Automatic Global analysis A third set of query modification techniques is called automatic global analysis. These techniques are thus called because they look at the entire corpus to obtain the relationships that are used to modify the query. We discuss two variants of global analysis. The first is based on a similarity thesaurus (Qiu and Frei 1993). This thesaurus is not derived from a co-occurrence matrix as used in the early QE work. Instead, the similarity thesaurus, while still arising from the document corpus, uses a distance metric which is based on the terms being concepts in a concept space. Innovatively, the concept space is indexed by the documents in which the terms appear. Additionally, this algorithm expands the query by choosing terms close (in concept space) to the centroid of the entire query, rather than those terms close to the individual query terms. The second form of global analysis joins terms by first clustering the documents into tight clusters, and then choosing the low frequency, high discrimination terms from within these clusters (Crouch and Yang 1992). While algorithmic QE are innovative sources of ideas, there are problems with both global and local analysis techniques. Their strength is their independence; they are designed to operate in a black box fashion, without explaining their actions. This is laudable; it allows us to measure and objectively maximise effectiveness of the pure QE solution. The problem is that these algorithms assume that there is no connection between the search action, and the user reaction. They give the best quality 25

results, without the ability to clearly explain how they arrived at them. This disconnection from the search process makes them seem arrogant; even if they do improve results, they do it in a way that could alienate the user by offering unfathomable results. This is due to a lack of relationship metadata; even though the algorithm has identified relationships, we do not know anything about them. The second problem is related to the first, and is common to any corpus-based technique. Our scenario assumes that the user does not know what s/he is searching for, and that the process of retrieval will be iterative and interactive. Corpus based QE has the potential to generate unfamiliar clusters of terms, connected only by association relationships. These are possibly unfathomable to the user, and could lead to confusion and distraction, interfering with the search process. Thirdly, corpus based QE demands a full traversal of the corpus, necessarily a computationally intense task. This also attracts the same criticism aimed at all knowledge structures, in that they ossify a set of knowledge, and go out of date quickly. B. Relationship Based QE The relationships in corpus based QE are derived from the collocation of terms within documents. Alternatively, we can perform query expansion using relationships derived from outside sources. In this section, we first contrast IR based on thesaurus and ontology sources. This is followed by a review of UMLS based medical QE, and finally, this information is put into context with an overview of IR relationship usage. i.

Thesaurus QE

The early experiments in thesaurus based QE had mixed results. The first experiments found synonym expansion to be always useful, whereas hierarchical expansion was useful in select instances only (Salton and Lesk 1968). Later work by Voorhees (Voorhees 1994) on Wordnet based lexical QE found that queries that were already effective were difficult to improve via query expansion. For the other queries, hand chosen expansion term candidates worked well, but it was hard to automatically pick good expansion candidates. An alternative method used QE to provide an intermediate structure between query and results, using linguistic methods to generate intermediate results. This was found to be useful for subject areas where there is much ambiguity, for example commonplace web search (Grefenstette 1997). Bodner and Song (Bodner and Song 1996) developed a system that combined different types of relationships, expanding first with corpus, then thesaurus relationships. This system innovated through its ability to constrain expansion both by setting a threshold maximum fan-out value, and/or a maximum expansion distance. Another noteworthy aspect of the study was a comparison between the standard single term expansion, and an algorithm called correlated search, which only expands terms 26

if the terms are connected in the ontology. Correlated search was found to provide better results than isolated search. A system called Deja-vu (Gordon and Domeshek 1998) used hand crafted sets of expansions called Expectation Packages, commonsense structures grouping like terms. In practice, all terms in an Expectation Package are connected; if you pick one, you get them all. The problems with these is that they are built with a subjective bias, are hard to maintain, historically set in time and possibly out of sync with collections. One of the latest systems (Järvelin 2001) performs QE on three levels of the data model: conceptual, linguistic and string. This model provides fine control of conceptual expansion especially, allowing choice of links to follow, and to which depth. This allows a high level of refinement, and provides a good structure for further, detailed experimentation, allowing much variability. This system was experimental, based on a hand crafted thesaurus and a small document base. ii. Ontology Based Information Retrieval The line between thesaurus QE and ontology IR is not clear, with both terms appearing in the literature. Thesaurus based QE seems to appear more often with text retrieval, improving the recall of a query by the addition of related words. Ontology IR operates at the knowledge retrieval level, connecting semantic concepts. Ontologies are also used to provide concordances used for schema expansion, solving the problem of structural imprecision in the realm of data retrieval. For example, if an ontology says A is-a B, if one is looking for types of A, try looking in B’s location. Ontology based IR uses these same sorts of structure to improve the recall of knowledge retrieval, working at a semantic rather that structural level. As such, the constraints on ontology based IR are more rigid, and the results more ‘correct’, than those of thesaurus based QE. An example of this is the GETSS system (Staab, Braun et al. 1999). This is a production system that uses an ontology to extend the document base. Strictness is provided by the use of direct, logical QE only, using the is-a relationships only. Similarly, Guarino developed a system for retrieval from yellow pages and product catalogues (Guarino, Masolo et al. 1999), where it was found that both recall and precision improved through the use of ontological structure. In this case, query expansion was not the goal of ontological use, rather, the ontology was used for filtering results. This case shows the promise of using ontological resource. In the medical domain, such an ontology has not yet been developed, and it would be onerous to construct. Other examples of ontology assisted IR include (Decker, Erdmann et al. 1999), (Visser and Schuster 2002), and (Aitken and Reid 2000).

27

Taveter (Taveter 1998) notes a problem with ontology assisted IR, in that classification of concepts under taxonomies differs depending on the viewpoint of the taxonomy. For example, the concept gene may have different sibling depending on whether it is viewed chemically, functionally, or evolutionarily. The MELISA system (Abasolo and Gomez 2000) is an example of ontology based medical IR. Apart from its medical application, MELISA’s contribution is that it separates the ontology, query, and aggregation operators, aggregating and ranking the results from each of the different query permutations. iii. Medical QE Using UMLS The area of automatic medical query expansion has also been especially fruitful. This is due to the existence of rich medical thesaurus resources, an abundance of medical text, and high user demand. The premier thesaurus resource in the medical area is UMLS. There has been much interest in using UMLS to facilitate query expansion, with mixed results. Early UMLS based QE showed that such a system, combined with retrieval feedback, returned results significantly above baseline statistical retrieval (Aronson and Rindflesch 1997) (Srinivasan 1996), showing the promise of using the rich resource provided by UMLS. Later work by Srinivasan (Srinivasan 1999; Srinivasan, Ruiz et al. 2001) uses a combined rough and fuzzy sets framework to explore UMLS based QE, assigning minimum and maximum probability based limits on the number of expansions chosen from each of the link types, providing more control than that obtained by adding all related query terms. This level of control is interesting, but the work was unevaluated. Hersh et al (Hersh, Price et al. 2000) found that unconstrained UMLS based synonym and hierarchical query expansion on the Ohsumed collection degraded aggregate retrieval performance, but some specific instances of QE improved individual query performance. They noted that there was a role for investigation of specific cases where QE is successful. A recent study by Houston (Houston, Chen et al. 2000) compared 3 different thesauri for medical QE, consisting of corpus terms, MESH terms, and UMLS terms. They found no significant difference between them, but also little overlap, supporting a case for further exploration to discover the significant features of each term set, and use this information to combine their strengths. Conversely, Leroy et al find that UMLS relationships provide better QE than that of Wordnet in the medical field. (Leroy and Chen. 2001) Earlier QE work by the same group (Leroy, Tolle et al. 2000) focus on a technique called Deep Semantic Parsing. This technique combines co-occurrence and

28

UMLS semantic relationship information, and derives from the idea that a given semantic relationship does not apply to every combination of members of two groups connected by this relationship. This technique uses these found semantic network relationships to limit co-occurrence relationships used for expansion. The system was evaluated over two types of query, sourced from medical librarians and cancer specialists. Cancer specialists had smaller queries; they were more precise, and had lower recall. Expansion improved recall for more comprehensive queries, while precision dropped more for smaller, more precise queries. iv. Relationships in the Service of IR In this section we look at the overall issue of relationships in IR, discussing the types of relationships that exist between queries and relevant documents. One way to think about this is through the lens of topicality, in that documents relevant to a query are likely to be on the same topic as the query. Notwithstanding this, there are other factors besides topicality that determine relevance. For example, an arbitrary hyperlink is some indication of relevance, but not always of topicality. Overall, there are a wide variety of possible relationships. Basic categories include synonymy, hierarchy, and associative type relationships. Within these gross categories, there are many different types. Green et al (Green 2001) point out that some categories probably have an enumerable, finite number of types (hierarchy, synonymy), but the set of associative type relationships is open, limited only by the ability of the human mind to make connections. Bean and Green (Bean 2001) did interesting work on the use of relationships in IR. They first point out that relationships have been used in IR both to broaden the retrieval net, increasing the number of documents retrieved by a query, and for subsequent filtering. This shows the versatility of the use of relationships in IR. Secondly, they note that there are an almost infinite variety of relationships that can connect a query to relevant documents. The only existence criterion necessary in such a relationship is that it must exist in the mind of a user. This is a large realm, and complicates the problem of query expansion, in that relevant relationships are only functionally distinguishable from non-relevant ones; they are difficult to classify, being context-dependent. Their study consisted of an in-depth analysis of a topical index and a set of matching documents. They found that only one third of relationships were direct topic matches, 10% were hierarchical topic matches, and surprisingly, more than 50% were ‘structural relationships’. The latter consisted of relationships to: 1) entire complex structures or gestalts; 2) intrastructural components; or 3) specialised complex structures such as metaphor. This shows that there is much scope for more than mere topic matching.

29

When looking at relationship based IR, it is important to have a knowledge of the different types of relationships. As noted above, relationships are complex, and ill defined. For example, Artale (Artale 1996) notes that even for a relationship type that we take for granted, part-whole relationships, there is complexity. Arbitrary part whole relationships are not transitive; eg an arm is part of a musician, a musician is part of an orchestra, but we would not say that an arm is part of an orchestra. To overcome this problem, the authors break part whole relationships into six distinct types, examples being feature/activity, member/collection, and stuff/object. Within the UMLS, there are three basic types of relationship: synonymy, hierarchical, mappings (other) (Bodenreider and Bean 2001). Nelson, (Nelson 2001) notes that, while hierarchies are traditionally thought of as narrower and broader relationships, in MESH, a leading subcomponent of UMLS, the hierarchies are more readily described as narrower and broader retrieval sets. Broader hierarchical relationships are generally one of is_a, part_of, conceptual_part_of, or process_of, but there are exceptions; for example, accident prevention is narrower than accident. v. Criticism of Relationship Based QE Relationship based QE, like corpus based QE, has also been found to be useful, with this area being widely explored. Too, the UMLS has been found to be a useful source of QE relationships. Even given this broad examination, there are places where more work is needed. Firstly, while these systems explore query expansion using structured relationships, they do not often implement the filtering action which is also a fruitful use of such relationships. The second point concerns the types of relationships used by QE. The possible set of useful QE relationships is large and growing, and there will be no simple categorisation of relationships that will yield perfect query expansion, because the human mind is more complex than this. The problem is that there are possibly more complex categorisations than are currently being used, and this complexity is unexplored. Notwithstanding Green et al’s conjecture that it is difficult to predict topicality in terms of other relationships, there has been little systematic study of this fact. Specifically, the above systems test this hypothesis poorly, if at all. Instead, they use a conservative set of expansion relationships, and in the case of ontology-based systems, very much so, only accepting strict (i.e. subsumption based) relationships. A broad survey of the relative effectiveness of different types of relationships, in relationship to the other thesauruses (previously, you used thesauri as the plural – which one do you want? ) and query variables, has not been done. This concern is a dynamic one, in that as relationship sources, and relevance indicators, continue to grow and become more complex, they will need continued study.

30

C. Criticism of Algorithmic QE As noted above, algorithmic QE is that QE which functions, from the user’s point of view, as a black box, with the input being initial query and the output, expanded query terms. In the study of algorithmic QE, the investigation into the sorts of relationships that hold between query and document has been widely explored. Even so, the complexity still demands more investigation. A recent study of a leading algorithmic QE method across two test collections found false the assumption that the same parameter settings can be used for all queries, and that further research is necessary to determine the features that would drive per-query individualisation of parameters (Billerbeck and Zobel 2004). On the positive side, algorithmic QE has had much success. In a large number of cases, it is able to improve retrieval success. The major problem with algorithmic QE was alluded to at the start of the chapter. QE in and of itself is a blunt instrument. Even with a better understanding of what makes up a good QE relationship, there are still too many possible ways to connect a query to relevant documents. Query expansion alone will not fulfil all the user’s desires. At best, it will increase the number of relevant documents at the expense of precision and/or open up new areas of exploration. It is in regards to this latter point that the pure algorithmic QE methods are lacking. They are successful at opening up new areas of document exploration, but, then, the user is left with just another list of documents. The power of a QE based tool could easily overwhelm the user. Its lack of interactivity often extends into the query expansion details themselves. Very few of the above systems offer user control over the QE process, such as the expansion depth or direction. The power of QE can be better harnessed by incorporation into other modalities of searching, especially those which can filter excessively large result sets. It is especially appropriate and symmetric to use filters that incorporate the self same relationships.

2.4.2

Interactive QE

Interactive QE opens up a large area for exploration. While the state of the art automatic QE maximises the additional retrieval success that can be wrought from a query, the addition of interactivity brings a large dimension into play. The interaction of human beings with information retrieval systems, while widely explored, is still a field in its infancy. Where computers are very complex machines, humans are even more so, and the interaction of these two entities has a combinatory effect on the potential complexity of the situation. We argue that tools which summarise and allow winnowing of query expansion results maximise query expansion benefit, by more fully utilising the human intelligence component of the system. Interactivity of IR is further enhanced through the integration of interactive QE, and there has been much work in the area. We divide the interactivity promoting enhancements into two areas, namely

31

choice and feedback. Choice tools are those which allow the user greater choice, whereas feedback are the areas that provide a broader range of information to the user. Choice tools include: •

multiple access modality



category based filtering



simple subsumption QE



multidirection QE



query refinement



query categorisation

Feedback techniques include: •

result set summarisation



location in hierarchy



query preview

The problem with categorising interactive QE systems with the above characteristics is that, in general, interactive QE are not designed to illustrate a single technique; instead, they aim to improve the entire retrieval experience. Because of this competing dynamic, we will first explain the above categories, noting example systems. We then examine as a whole those systems of significance, such as Dynamic Taxonomy, Flamenco, Cat-a-cone, and Dynacat. Finally, we evaluate, noting support and drawbacks of the interactive QE techniques. A. Applicable Features of Interactive QE Because interactive QE is designed to deal with the complexity of the human situation, there have been a wide variety of systems. In this section, we extract the features of interactive QE systems which aid in resolving imprecise query. i.

Multiple access modality

Multiple access modality is the situation where the user gains access to the document collection through a variety of integrated methods. This extends the browse aspect of all query based searching, where the user scans the retrieved documents. Generally, we are talking about a combined access that comes in the form of a synergistic combination of query and hierarchy. This combination is incorporated into almost all of the reviewed systems. Here, we present two exemplars. First is Joho’s use of a mixed query/browse system (Joho, Sanderson et al. 2004). There, it was often the case that the hierarchy was accessed after an examination of the first page of search results, with several positive benefits. It reduced the number of iterations and paging actions, increased (over baseline) the chance of finding relevant items, and in general, helped the users to handle the large number of candidate expansion concepts.

32

Another ‘query first, then browse found relationships’ implementation is based on a Kohonen self organizing graphical concept map (Chen, Houston et al. 1998). It was found that the use of the thesaurus enhanced recall, but not at the expense of precision. A positive feature of this type of system was the fact that the thesaurus terms offered would always exist in the pages, providing strong information scent, encouraging users to continue searching. The thesaurus was found to be most useful for refining broad searches. On the downside, the map concept interfered with the search of subjects which required a more directed search. These systems limit the interaction between query and hierarchy to that of allowing browsing from the results of a query. They do not allow further restrictions using hierarchy; nor is it possible to search within the results using another query. ii. Category based filtering Category based filtering is another common feature of these systems. In this technique, the current result set can be selected and filtered using metadata derived categories. Dynamic Taxonomy is a system built around this type of filtering, probably because it has no query component. iii. Simple Subsumption QE Subsumption QE is a limited form of QE, allowing the user to choose terms in a hierarchy and retrieve all documents categorised by either these terms or categorised by any terms under these terms. This is a popular technique, but is predicated on the ability of a system to allow retrieval via hierarchical term selection in the first place. Dynamic Taxonomy and Flamenco provide a good example of this behaviour iv. Multidirection QE Sebastiani’s extends simple QE with a broader definition of expansion (Sebastiani 2001). Their system allows a jump to not only terms related to query terms, but also to words connected to parent or children terms, using relationships derived from a combination of thesaurus and corpus. v. Query Refinement Query refinement is the situation where we offer the user the option of replacing a query term with a related term. This has a more profound effect on search direction than query expansion, due to the removal of the original term. It is seen as more of a precision-enhancing move than the traditional focus on query expansion recall enhancement. An interactive example of this can be seen in (Tomita and Kikui 2001).

33

vi. Query categorisation Query categorisation is a relatively rare trait whereby a system categorises the incoming queries, which then determines the presentation of results. Among the systems reviewed here, only Dynacat uses this feature. vii. Result set summarisation A powerful feedback mechanism involves showing the user a summarised picture of their current result set. This sums up a query’s effect, and provides direction for the next move. This allows the user to look at the document set from an overview, gradually zooming in on the documents s/he is interested in. While this feature is implemented in Dynamic Taxonomy, and Flamenco as well, CIQUEST and Shneiderman provide us with concise examples. CIQUEST, the Concept Based Interactive Query Expansion Support Tool (Beaulieu 2003) incorporates an initial query into a subsumption based browsing structure. It retrieves documents based on an initial query, and builds a concept tree based on the concepts in the result set, and words related to those concepts using relationships from Wordnet, a large general purpose English language thesaurus. They then allow filtering based on the derived concept tree. A highly graphical version of this was done by Shneiderman (Shneiderman, Feldman et al. 2000). It represented the entire result set via a two dimensional visualization, with the axes showing hierarchical and categorical metadata respectively. viii.

Location in hierarchy

Similarly, while browsing metadata, many systems find it useful to show the user the location of their current position in a subsumption hierarchy, inhibiting user disorientation. Most of the systems that possess a hierarchy implement this feature: Dynamic Taxonomy, Flamenco, Cat-a-cone. ix. Query preview Query preview was developed for use in distributed retrieval systems, where bandwidth costs were such that it was impractical to retrieve the entire result set. Instead, result set features such as the numbers of documents and document metadata were retrieved.. Interactive QE systems such as Dynamic Taxonomy and Flamenco use this feature for an alternate purpose, to provide sufficient information scent while not overwhelming the user with too much detail. B. Interactive QE Example Systems This section delves into some example interactive QE systems, describing each more fully.

34

i.

Dynamic Taxonomy

Dynamic Taxonomy (DT) is an innovative combination of query expansion and interactive result set modification. It has also been used for product feature demonstration (Sacco 2003), newspaper article retrieval, as a front end to a database of renaissance era paintings (Sacco 2000; Sacco 2002), and in a medical guidelines browser (Wollersheim and Rahayu 2001). It was designed to use the structure found in a simple ontology to enhance user IR. In summary, DT provides a browsing framework, and then provides classification and filtering tools which organise the browse process. DT, while capable of incorporating other modalities, is fundamentally browse-based. Users navigate a metadata-based hierarchy, choosing terms, and obtaining document sets that are categorised under these terms, or terms subsumed by these terms. By the inclusion of these subcategories, DT performs simple, strictly hierarchical expansion of chosen query terms. The entailing oversupply of documents that results from such a query expansion is handled by query preview of search results, and via the zoom operator, a dynamic filtering of the result set. A problem with a metadata browser is that of index term overload. DT reduces the complexity of the index in two ways. Firstly, it displays a hierarchical view of the index term set, allowing complexity to be hidden under higher-level terms. Secondly, it provides a zoom operator, which dynamically reduces the term set. The zoom operation reduces the index term set by filtering it against the document base. In response to a request to zoom on an index phrase, the system prunes the taxonomy, retaining only the taxonomic elements that categorise the set of text atoms that are also categorised by the zoom phrase. More importantly, any terms in the taxonomy which do not, either directly or indirectly, classify the remaining text atoms are pruned. This effectively filters the term set against the set of documents that contain the term (or subsumptions thereof). Zooming on a term lays down a semantic anchor, around which further operations pivot. After a zoom, other retrieval operations can continue on the reduced taxonomy, and successive zooms add further constraints. A zoom example is shown in Figure 2-3 and Figure 2-4. Figure 2-3 shows a sample taxonomy derived from the medical field. Figure 2-4 shows the pruned taxonomy after a zoom on the concept amoxicillin. If a text atom is not classified by amoxicillin, it is not included in the pruned taxonomy. In summary, DT uses a direct acyclic graph (DAG) of subsumption links as an index structure to summarise a document set. It is a framework that allows users to browse topics in a document base, drilling down through an increasing level of taxonomic detail. The zoom operation dynamically prunes the browse tree.

35

DT is a prototypical attempt at solving both of the problems associated with imprecise queries: too few results, and too many. Its simplicity is a virtue, in that it can fit into different exploration methods, because the zoom and summarisation tools can be used for any arbitrary corpus subset generated by other tools. While DT is both simple and powerful, its full strength has not been explored. While it can theoretically be used in conjunction with other modalities, this has not yet been demonstrated. The use of hand-crafted taxonomies attracts the same drawbacks as that found with other knowledge retrieval solutions, being the ossification of the summary data. The DT’s QE is strictly subsumption based. The use of other expansion types could potentate DT usefulness. A final criticism of DT is its rigidity. It use of Boolean logic to accept or reject documents mean that the criticisms of Boolean retrieval apply here as well.

Pathogens Medical concept taxonomy

Drugs

amoxycillin

procaine penicillin

Conditions

vancomycin

Streptococcus pneumoniae

Conjunctivitis

Enterococcus faecalis

Text atoms

Acute cholecystitis

Meningitis

Hospital acquired meningitis

Gonococcal conjunctivitis

Meningitis : Haemophilus influenzae type b

Figure 2-3 Example taxonomy, and subsequently classified text atoms. Solid lines show taxonomic is-a links; dotted lines denoted classification by a taxonomic term.

36

Pathogens Medical concept taxonomy

Drugs

amoxycillin

Conditions

procaine penicillin

Conjunctivitis Enterococcus faecalis

Text atoms

Acute cholecystitis

Gonococcal conjunctivitis

Figure 2-4 Remaining taxonomy after a zoom on the concept amoxicillin. Atoms that are not classified by amoxicillin are pruned; also pruned are the sections of the taxonomic tree that would be unused.

ii. Cat-a-Cone Early in interactive QE history, Hearst (Hearst and Karadi, 1997) developed the Cat-a-Cone system, which incorporated a 3D display paradigm for specifying searches, and navigating large category hierarchies. It offered simultaneous display of categories and retrieved documents. Cat-a-Cone combines many simple ideas to enable a very flexible system. Some of the design decisions include: •

having many modalities of interaction;



allowing conjunctions of disjunctions;



a quorum ranking retrieval strategy;



selection of multiple categories, along with their hierarchical context;



ability to view the category tree to categories that are in the current retrieval set; and



ability to combine category choice, and free text search.

iii. Flamenco Later work by Hearst on a project called Flamenco (Hearst, Elliott et al. 2002) shows continued innovation. Flamenco tries to model search flow, with a focus on results, not on the search process per se. Guidelines that drive the project include: •

Hyperlinks should be integrated into the search results, because hyperlinks outperform search on most websites.



Clustering is not as useful as is categorisation under categories that have predictable, understandable meaning.

37



An explicit exposure of hierarchical faceted metadata, allowing refinement and expansion of the current query, and guiding users to possible choices.



Incorporation of query preview, showing the number of records available for each metadata attribute, providing a summary of the data and in particular the current query.

iv. Dynacat Our final exploration of interactive QE covers a tool called Dynacat (Pratt 1997; Pratt 1999; Pratt and Fagan 2000). It too combines query with structured browsing, but in the medical domain. In operation, it does dynamic categorisation of search results based on query type, retrieved document set, and taxonomic model of domain. It first fits the queries into a query model, and then dynamically generates a hierarchy based on both the found query type, and the documents retrieved. Upon evaluation, Dynacat users found significantly (P0. | ni |

85

# A B C D E F G H I J K L M N O P Q R S T

Concept Name Chronic inflammatory polyneuropathy inflammatory polyneuropathy, unspecified Polyneuropathy polyneuropathy in disease nos peripheral nervous system diseases Nervous system diseases System disorders of the nervous system peripheral neuropathy neuromuscular diseases Myopathy Neurologic Disease diseases (mesh category) Physical disorders Disorder by body system or organ function health and disease Nervous system disorders, general and nec Epilepsy, absence Nervous system illness, nos

Figure 5-5 Probability based expansion starting from chronic inflamatory polyneuropathy. Circled expansions exist in the relevant document set, while expansions in rectangles do not exist in the relevant set. Dotted links represent expansion paths excluded due to the chosen method’s criteria.

A base flaw in the simple probability method mentioned above is that no probability is subtracted for traversing a node that does not have multiple outgoing links. This can be fixed through a simple modification. We call this subtractive probability, and it is laid out so that every link consumes semantic weight. The formula we use is:

p ( ni ) =

p (ni −1 )  p (ni −1 )   −  c ∗ | ni −1 |  | ni −1 | 

where p(n) is the semantic weight of node n, i is the distance from the source node, and |n| is the number of edges leaving from node n. The damping factor, c, ranges from 0 and 1, and denotes the proportion of the parent node semantic weight that is lost at every step away from the source node. In and of itself, this modification does not generate expansions in a different order from the simple probability method. The reason for this is that it, like the simple probability method, is fair, in that all children of a node have the same probability. This means that a basic subtraction method generates expansions in exactly the same order as the simple probability method, because they both treat all

86

outgoing edges from a node the same. With subtract, the final probability is lower, due to the incorporation of a damping factor, but the order in which the expansions are generated is the same. Subtract probability only shows a difference when incorporated into techniques that generate different scores for each edge leaving a node, where the damping subtraction can make a difference to output order. Therefore, we examine the subtract modification via incorporation into the average probability method. Table 5-3 shows the first 20 expansions of a subtract average probability expansion starting with the concept Hypoaldosteronism. The graph for expansion is similar to Figure 5-4, because this expansion also chooses only those concepts adjacent to the source concept. This is because there are numerous concepts, arising from the lack of other expansion restrictions, and so, expansion does not progress past the first level. This method differs from standard depth limited because it ranks the choices. Table 5-3 Concepts generated by the subtract average probability method. The starting query concept was Hypoaldosteronism.

Connected Concept

Relationship

Acquired Immunodeficiency Syndrome Addison's Disease Adrenal Gland Neoplasms Adrenal hypertrophy or hyperplasia Aldosterone Cushing Syndrome Hyperaldosteronism Hyperreninemic hypoaldosteronism Acidosis, Renal Tubular Adrenal Gland Diseases Adrenal Gland Hyperfunction Adrenal gland hypofunction Adrenal insufficiency due to adrenal metastasis Adrenoleukodystrophy Congenital hypoplasia of adrenal gland Iatrogenic adrenal insufficiency Post-adrenalectomy adrenal insufficiency Type I Renal Tubular Acidosis Virilism

RO SIB SIB SIB RO SIB SIB CHD RO PAR;RB SIB PAR SIB SIB SIB SIB SIB RO SIB

Is Relevant? Y Y Y Y Y Y Y Y N N N N N N N N N N N

In summary, the simplicity of the probability methods (they are O(n), where n = number of output nodes) make them ideal first candidates in multistage expansions. They can reduce the size of the search space, and make the task of the more sophisticated methods manageable.

87

C. Voting Methods It is the nature of relevance that there is no absolute right and wrong. What is a relevant document for one person’s query might be an irrelevant document for the same query of another person; so too with ontologies where there are many possible correct relationships between concepts, depending on information need. For example, the relationship between the concepts Hysteria (UMLS# C0020701) and Dissociative Disorders (UMLS# C0012746) is classified as no less than six separate types: child, parent, relation broader, relation narrower, relation other, and sibling. While this implies that there is some, probably high semantic value, relationship between the concepts, the exact nature of that relationship is uncertain. While this variety can be source of frustration, it also can serve as a source of usable meaning. A set of variously labelled links between two concepts merely has a different meaning than a single link, or even a set of links with the same label. We propose to use this variety for query expansion. The idea is to conceptualise each link as a vote for a semantic relationship between terms. This extends the redundancy exploration work done by (Bodenreider 2003). Using this idea, we will evaluate several alternative experiments which use link degree within the basic probability query expansion framework. We call the first such method simple voting. It treats each link as having equal value. If there are 4 links between A and B, and 1 link between A and C, if p(A)=1, then p(B)=4/5, and p(C)=1/5. In simple voting, and for that matter, all the methods considered until now, we have been democratic; all relationships are created equal. With some relationship sources, such as co-occurrence data, this is all that is possible; we have little other relationship information. But most ontological sources are richer, containing much ancillary relationship data. Even within the limited UMLS metathesaurus framework, there are nine major categories of relationship. In the second voting technique, semantic voting, we use this wealth of relationship information to weight our votes according to how much semantic information they transfer. The problem with this is that, while we can surmise that the different relationships correspond to differing semantic distance, we still need to quantify this distance. We bootstrap this technique using values calculated with a previous methodology, evaluating each relationship type on its own, and seeing how well it individually generates expanded query concepts. We first generate expansions, drawing exclusively from each of the nine UMLS relationship types. The method uses the simple probability technique with a minimum probability of 0.001, limited to the first 30 expansions. The expansions are then evaluated as to how likely they will be to find a concept that exists in a relevant document, and then we use the results discovered to calculate semantic weighting scores for each of the nine relationship types; they are the normalised average precision that

88

this relationship has when performing expansions on its own. The precision ratios generated can be seen in Table 5-4. Semantic voting, shown in Figure 5-6, makes use of this calculated semantic weighting, making the vote of a link equal to a semantic weight determined by the relationship type. It is this formulation that allows us use UMLS relationship type information. Semantic voting uses simple probability as its base model, but it could incorporate ideas from both subtractive and average probability models. This work does not explore these dimensions. Table 5-4 Weight precision of each of the major UMLS relationship types. Values shown as a proportion of the total precision, calculated over the first 30 concepts generated. For example, a CHD relationship is more than nine times less likely to find a relevant expansion than AQ.

Name

UMLS Code CHD RL RB RN RO PAR SIB AQ QB

Child Like (Synonym) Broader Narrower Other Parent Sibling Accepts as Qualifier Qualified By

Precision 18% 18% 15% 13% 12% 10% 10% 2% 2%

RB Relation W(RB)=0.4

A P=1

tion Rela PAR R)=0.2 W (PA lation PAR Re 0.2 )= R A (P W

SIB Relation W(SIB)=0.2

B P=0.8

C P=0.2

Figure 5-6 Weighted voting calculation. Given simplified weights of relationships as shown, (RB=0.4, PAR=0.2, SIB=0.2), then P(B) = P(A) * W(B)/W(total).

An example of a place where semantic voting shows promise follows.

Table 5-5 shows the

expansions generated by the voting method based on the concept angiotensin converting enzyme inhibitor (ACEI) from the Ohsumed query looking for an ACEI review article. Voting expansion serves this query well because voting expansions generate broadly connected concepts, ‘common knowledge’ so to speak. Common knowledge is exactly what is demanded by a review article. Again, because of the lack of other restrictions on the expansion, the concepts are all adjacent to the source

89

concept, which fits well into the philosophy of ‘well connectedness’. This produces a graph with a similar shape to Figure 5-2. D. Directional Techniques A directional technique is another expansion method that uses the semantic information contained in the ontology, but it too incorporates a heuristic component, consisting of outside knowledge that imposes a structure on the relationship types. It says that there are certain relationships that point upwards (PAR, RB), and others that point downwards (CHD, RN). Table 5-5 Weighted voting method example. List of the first 20 concepts generated by the method starting from the UMLS concept AngiotensinConverting Enzyme Inhibitors. Relationship is the type of relationship between ACEI and the connected concept, and Is Relevant? denotes whether this concept exists in the relevant document set. Relationship types are defined in Table 3-4.

Connected Concept Adrenergic beta-Antagonists Antihypertensive Agents Captopril Cilazapril Enalapril Enalaprilat Lisinopril Perindopril Renin-Angiotensin System Saralasin Benazepril Quinapril Anti-Arrhythmia Agents Fosinopril Protease Inhibitors Ramipril Tissue Inhibitor of Metalloproteinases Moexipril Trandolapril

Relationship Type SIB PAR, RB, RO CHD, RN, RO CHD, RN CHD, RN, RO RN, RO CHD, RN, SIB CHD, RN RO RN, SIB CHD, RN CHD, RN PAR, SIB CHD, RN PAR, RB CHD, RN SIB CHD, RN CHD, RN

Is Relevant? Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N N

These directions are then used by an expansion scheme called central tendency. The idea behind central tendency is that the depth of a concept in the ontology will influence its successful expansion strategy; specifically, we say that concepts will more likely expand towards the centre of the ontology. The algorithm we use is: •

where h is the depth function, if h(n) > average(h(n)), expand upwards



if h(n)< average(h(n)), expand downwards.

In our experiments, we use the distance from top of the tree as our depth measure.

90

E. Semantic Propagation Techniques The probability-based techniques above can be viewed as the simplest forms of semantic propagation techniques. Semantic propagation is the name for the class of expansion technique that takes the semantic weight of a term, and propagates it along the pathways that connect this node to the rest of the ontology. The inspiration for this class of methods is Google’s pagerank calculation (Brin and Page 1998). The idea behind pagerank is simple; the pagerank of a web page is distributed evenly among its outgoing links, and the pagerank of a page is calculated to be the sum of the pageranks that it receives from the links that point to it. In semantic propagation, we take this idea, and apply it to the nodes in an ontology. In its simplest sense, the semantic content of a node is influenced by the semantic content of the nodes that point to it. In the case where there are two paths to a node, there are several alternatives; we can take the largest weight, or a sum of the weights. The former in fact is a conceptualisation equivalent to the basic probability depth measure, explained above. The first semantic propagation technique we examine is based on the pagerank algorithm, due to its proven worth, and unique usefulness. But there are several problems with this choice. Pagerank does not have the concept of a source node; the rank of the pages is determined by iterating through all the pages, until the entire system arrives at an equilibrium. In our case, we have one source of semantic value in the entire graph, but the pagerank algorithm generates rank spontaneously. It is likely that the source node would not end up as the highest ranking node in the graph. Pagerank is inappropriate in our case, because pagerank uses hypertext links, which in a sense, vote for the page that they point to. In an ontology, the semantic meaning of link flows both ways. Even though pagerank is not a suitable vehicle for calculating semantic neighbourhood, it is a prototype for a class of expansion methods that have such functionality. One exemplar we call log multiple. The log multiple method derives from the observed fact that, often within the UMLS metathesaurus, where there are multiple paths between two nodes, these nodes are more highly related than the mere sum of their probabilities would imply. An example of this motivation is from Ohsumed, consisting of the relationship between the concepts Adrenergic beta-Antagonists and Propanolamine, two classes of organic chemical. The former is a concept from query 19, while the latter is a concept from the documents said to be relevant to that query. While UMLS does not note any direct relationship between these two concepts, there are 160 two level relationships between them. We surmise there is small semantic distance between these two concepts, even though there is no distance=1 link. We want a technique that will reward a concept when there are multiple paths between it and a source concept.

91

We do this by calculating the probability of arriving at a node to be: n

∑P

inco min g

*(1 + log(max(in − out ,0) * w) ,

1

where Pincoming is the probability of each of the incoming links, in and out are the number of incoming and outgoing links respectively, and w is a weighting factor that is the reward given to multiple links. We use log(n) to reduce the reward for larger number of links, while the max(in − out ,0) ensures that rewards flow to only relatively well source-connected nodes, and not to merely promiscuous ones. In our method, we do not repeatedly traverse all the nodes on a graph a la pagerank, but instead, we propagate only the semantic meaning outwards from the source node. More formally, p(N1) directly affects P(N2) iff there is a link between N1 and N2, and d(N1) maxRank ) goon = false; if ((current.depth + 1 < maxDepth) ) queue[++queueTail] = n.hashCode(); } } Utility.log( identifier + ":" + "DL SIze= " + queueCurrent + ", q length= " + queueTail + ", Depth = " + current.depth + ", Key = " + current.key + ", total = " + totalSeen + ", misses = " + (totalSeen - hits) + ", Hits = " + hits, 49); //} if (++queueCurrent > queueTail) { goon = false; } else { current = rv.getNode(queue[queueCurrent]); } if (currRank > maxRank ) goon = false; } return rv; } }

206

/* ********************************************************************* * Filename: qe.process.FilterLogMultipleSum.java ********************************************************************* */ /* * Created on Feb 12, 2004 * * filters a graph according to LogMultiple filter * Log multiple transmits semantic information in outward direction from * source node * weights according to number of links that come from the source node * LM 2 takes into account the difference between inward degree and outward degree * LMsum simply adds in the upstream probability * LogMultipleSum just adds in upstream. Does not reinforce * LMsum simply adds in the upstream probability * * Notes: * LogMultiple uses all link information, not merely the information used * to figure out the graph subset. For example, depth limit only uses only outgoing * links, throws away all other links * * * */ package process; import java.util.Enumeration; import java.util.Hashtable; import java.util.Vector; import import import import import

qe.Edge; qe.Graph; qe.Node; utility.PrintfFormat; utility.Utility;

/** * @author Dennis Wollersheim * * Object that contains all the information necessary for a transformation of a graph * */ public class FilterLogMultipleSum extends Filter { private static final String identifier = "logMultipleSum"; public FilterLogMultipleSum(Hashtable parameters) { super(parameters); id = identifier; } public static void main(String args[]) throws Exception { /* Correct output * b Probability = 0.3333333333333333 * a Probability = 1.0 * e Probability = 0.3333333333333333 * d Probability = 1.1287647870399635

207

* c Probability = 0.3333333333333333 */ Utility.LOGLEVEL = 99; Graph g = new Graph(); Node a = new Node("c1"); Node b = new Node("c2"); Node c = new Node("c3"); Node d = new Node("c4"); Node e = new Node("c5"); g.add(a); g.add(b); g.add(c); g.add(d); g.add(e); g.add(new Edge(a, b)); g.add(new Edge(a, c)); g.add(new Edge(d, b)); g.add(new Edge(c, d)); g.add(new Edge(a, e)); Hashtable h = new Hashtable(); h.put("sourceNode", "c1"); Filter f = new FilterLogMultipleSum(h); f.doFilter(g); g.setDisplayFileName("/home/dewoller/try.gdl"); g.displayGraph(); g.printNodes(); } /* Algorithm: * Take all the nodes on the current graph * repeatedly cycle through the nodes, assigning depths * until we have assigned them all a depth * * create a depth ordered list of nodes * * repeatedly travese the nodes in depth order, * propagating probability outwards from source node * * @see qe.Filter#doFilter(qe.Graph) */ public Graph doFilter(Graph inputGraph) { // filters in place, input graph modified Utility.log(identifier + ":" + "In doFilter", 40); inputGraph.clearNodeDepth(); inputGraph.clearRelationshipCache(); Utility.log(identifier + ":" + "Cleared Cached Values", 40); int numNodes = inputGraph.getNodes().size(); if (numNodes == 0) return inputGraph; Node traverseOrder[] = new Node[numNodes]; Utility.log(identifier + ":" + "input size = " + numNodes, 40); int count = 1; int current = 0; traverseOrder[0] = inputGraph.getOrAddNode(inputGraph.getSourceNode()); traverseOrder[0].setDepth(0); int nPasses = getIntProperty("nPasses", 10); // initialise values

208

while (count < numNodes) { // find all the nodes connected to the current node that do not have a depth Vector v = inputGraph.getNeighbours(traverseOrder[current]); Utility.log( identifier + ":" + "Traversing Neighbours of " + traverseOrder[current].key + ", size = " + v.size(), 50); for (Enumeration e = v.elements(); e.hasMoreElements();) { Node n = (Node) e.nextElement(); // if they don't already have a depth if (n.depth == -1) { // assign them a depth n.setDepth(traverseOrder[current].depth + 1); Utility.log( identifier + ": " + "MakeTree" + n.key + " d= " + (traverseOrder[current].depth + 1), 50); // add them to the traverseOrder array traverseOrder[count++] = n; } } // get next node to traverse current++; assert count > current; } Utility.log(identifier + ":" + "Finished generating tree ", 40); // set the probability of the start node traverseOrder[0].setProbability(1); for (int passCount = 0; passCount < nPasses; passCount++) { Utility.log(identifier + ":" + "Pass: " + passCount, 51); for (int currentNode = 0; currentNode < numNodes; currentNode++) { // calculate the probability of this node, // from how many links a) have probability, and b) end at this node // and c) that we have probability info for yet Node n = traverseOrder[currentNode]; double totalProbability = 0; // we don't want to fiddle with the probability // that emanates from the base node if (currentNode > 0) { for (Enumeration e = inputGraph.getAncestors(n).elements(); e.hasMoreElements(); ) { // for each ancestor of the current node, find out how much probability that // it will contribute to this node

209

// sum the downstream probability of every node that points to the current node Node upstreamNeighbour = (Node) e.nextElement(); totalProbability += inputGraph.getDownstreamProbability( upstreamNeighbour); } assert n.neighbourDegree > 0 : "neighbourDegree not greater than 0, = " + n.neighbourDegree + " key=" + n.key + " source=" + ((Node) traverseOrder[0]).key; n.setProbability(totalProbability); Utility.log( identifier + ":" + "Node: " + n.key + " depth: " + n.depth + " probability: " + new PrintfFormat("%1.7e").sprintf( n.getProbability()) + " UpstreamN: " + inputGraph.getAncestors(n).size() + " TotalN: " + n.neighbourDegree, 50); } } } if (hasProperty("minProbability")) { inputGraph.prune(getDoubleProperty("minProbability")); } if (hasProperty("maxRank")) { inputGraph = FilterTopN.prune( inputGraph, getIntProperty("maxRank"), this.properties); } return inputGraph; } // end of class }

210

/* ********************************************************************* * Filename: qe.process.FilterProbabilityLimit.java ********************************************************************* */ /* * Created on Feb 12, 2004 * * filters a graph according to Probability filter * */ package process; import java.util.Enumeration; import java.util.Hashtable; import import import import import import

qe.Edge; qe.Graph; qe.Node; qe.SortedNodeList; utility.PrintfFormat; utility.Utility;

/** * @author Dennis Wollersheim * * Object that contains all the information necessary for a transformation of a graph * */ public class FilterProbabilityLimit extends Filter { private static final String identifier = "probabilityLimit"; public FilterProbabilityLimit(Hashtable parameters) { super(parameters); id = identifier; } public static void main(String args[]) throws Exception { Utility.LOGLEVEL = 99; Filter.test(args, identifier); } /* Algorithm: * get the source node * expand outwards from source node * keep track of nodes traversed so that we don't get stuck in a loop * @see qe.Filter#doFilter(qe.Graph) */ public Graph doFilter(Graph inputGraph) { Hashtable traversed = new Hashtable(); SortedNodeList toTraverse = new SortedNodeList(); double minProbability = getDoubleProperty("minProbability"); Graph rv = new Graph(this.properties, inputGraph); Node current = new Node(inputGraph.getSourceNode(), 1.0); current.setDepth(0); rv.add(current); traversed.put(current.key, current.key);

211

int maxRank = getIntProperty("maxRank", 100000); int currRank = 1; Boolean goon = true; // don't expand any nodes that are less probable than maxProbability while ((current.probability > minProbability) && (goon)) { for (Enumeration e = inputGraph.getNeighbours(current).elements(); e.hasMoreElements(); ) { Node n = (Node) e.nextElement(); String nextKey = n.key; // if we have not traversed this node before if (traversed.get(nextKey) == null) { // don't add to graph if too improbable double probability = current.probability / inputGraph.neighbourCount(current); if ((probability > minProbability) && (++currRank 0) && ( rv.getNumNodes() > maxRank )) { goon = false; }

212

} return rv; } }

213

/* ********************************************************************* * Filename: qe.process.FilterProbabilitySubtractAverage.java ********************************************************************* */ /* * Created on Feb 12, 2004 * * filters a graph according to Probability subract average filter * */ package process; import java.util.Enumeration; import java.util.Hashtable; import import import import import import

qe.Edge; qe.Graph; qe.Node; qe.SortedNodeList; utility.PrintfFormat; utility.Utility;

/** * @author Dennis Wollersheim * * */ public class FilterProbabilitySubtractAverage extends Filter { private static final String identifier = "probabilitySubtractAverage"; public FilterProbabilitySubtractAverage(Hashtable parameters) { super(parameters); id = identifier; } public static void main(String args[]) throws Exception { Utility.LOGLEVEL = 99; Filter.test(args, identifier); } /* Algorithm: * get the source node * expand outwards from source node * keep track of nodes traversed so that we don't get stuck in a loop * * @see qe.Filter#doFilter(qe.Graph) */ public Graph doFilter(Graph inputGraph) { Hashtable traversed = new Hashtable(); SortedNodeList toTraverse = new SortedNodeList(); double minProbability = getDoubleProperty("minProbability"); double dampingFactor = getDoubleProperty("dampingFactor"); int maxRank = getIntProperty("maxRank", 100000); int currRank = 0; Graph rv = new Graph(this.properties, inputGraph); Node current = new Node(inputGraph.getSourceNode(), 1.0); current.setProbability(1.0 / inputGraph.neighbourCount(current)); current.setDepth(0);

214

toTraverse.add(current); traversed.put(current.key, current.key);

// don't expand any nodes that are less probable than maxProbability while ((current.probability > minProbability) && (!toTraverse.isEmpty()) && (rv.getNumNodes() minProbability)) { Utility.log( identifier + " ++:" + nextKey + " N= " + toTraverse.size() + " Pc = " + new PrintfFormat("%0.6f").sprintf( current.probability) + " Pn = " + new PrintfFormat("%0.6f").sprintf(probability) + " Weight = " + new PrintfFormat("%0.5f").sprintf( fullNeighbour.getTotWeight()) + " Deg = " + new PrintfFormat("%7i").sprintf( inputGraph.fullNeighbourCount(current)), 49); n.setProbability(probability); n.setDepth(current.depth + 1);

218

n.setLastNode(current); toTraverse.add(n); } traversed.put(nextKey, new Object()); } } Utility.log( identifier + ":" + "q Length= " + toTraverse.size() + " Prob = " + new PrintfFormat("%1.7f").sprintf(current.probability) + ", Key = " + current.key, 49); } return rv; } }

219

Appendix 4 XML Parameter File for Query Expansion Framework

220