Progress on Human-Computer Interaction in the GENIA Project on the Internet 3
Nigel Collier, Hyun Seok Park and Jun-ichi Tsujii Department of Information Science Graduate School of Science University of Tokyo, Hongo-7-3-1 Bunkyo-ku, Tokyo 113, Japan
E-mail:
fnigel,hsp20,
[email protected] June 1, 1999
Abstract
The GENIA project is aimed at extracting information from medical research papers and is applied to abstracts in the genome domain. A key aim of the project is to aid users to access information and we have decided to integrate the language processing software into the Internet. This integration requires some thought if it is to be useful and we are cooperated closely with groups of biologists to maximise functionality. This paper describes our early experience in designing the human interface and maximising the eectiveness of user interaction with the system. 1
Introduction
The GENIA project is aimed at extracting information on-the- y from medical journals and abstracts. Since a key aim of the project is to aid users to access information we have decided to integrate the resulting tool into the Internet. This integration requires some thought if it is to be useful and we have cooperated closely with groups of biologists to maximise functionality. This paper describes our experience in designing the human interface, and provides an overview of the system itself, explained using screen shots. The application of language engineering to the Internet and the World Wide Web [1] is a natural progression of the technology for at least two reasons. Firstly, there is the word-based nature of web pages, email and other online information, and secondly there is a need for information retrieval and extraction tools to help overcome the problems caused by information overload. Although the exponential growth in the number of WWW servers is likely to slow in coming years, the growth in the amount of online information is likely to increase as users are encouraged by this popularity to extend access to their own information by migrating it from paper-based to electronic-based delivery systems. 1
An example of this migration within science and technology is the online availability of journal papers [12] and other sources of primary information. We have chosen to work in the domain of biology research papers and have found that this research eld already makes wide use of the Web. These days it is not only convenient to have access to primary information in online databases, but it is becoming essential as the amount of scienti c literature increases. An increase in the quantity of data however, its interconnectivity and speed of delivery does not unfortunately guarantee that the user has improved his/her level of information access which can be considered as the timely ability to retrieve and understand useful data. This is because much of the data will be noise, of no or little interest to the user in meeting his/her goal. For this reason it is necessary to improve language engineering technologies such as information retrieval, ltering and extraction as well as machine translation (so called crosslanguage information access [6]). Below we outline the major components of the GENIA [3] system we are developing and then discuss how we can meet the user's goals with reference to the user interface which links all the components. 2
Project Overview
In the context of the global research eort to map the human genome, the GENIA project aims to support such research by automatically extracting useful information from biochemical papers and their abstracts written by domain specialists. GENIA seeks to automatically extract this information from a large repository of publically available research texts called MEDLINE [8] which are written by domain experts. The key elements of the project can be seen in Figure 1 and a sample PC screen image in Figure 2. The system modules are described below. Through discussions with domain experts, we have identi ed several open classes of named entitites such as the names of proteins (including transcription factors) and genes. The reliable identi cation and acquisition of such class members as well as the values of their attributes as they appear in free text is one of our key goals so that terminology databases can be automatically extended. This is also the basis for information extraction tasks such as scenario template extaction of events such as cell signalling. We should not however underestimate the diculty of this task as the naming conventions in this eld are very loose. Due to the diculty caused by inconsistent naming of terms, we have decided to use multiple sources of evidence for classifying terminology. In our initial experiments we used the ENGCG shallow parser [14] to identify noun phrases and classify them as proteins [13] according to their cooccurrence with a set of verbs. In the next phase we will use statistical models trained on pre-classi ed database lists of terms. Named entity and template element task
2
Figure 1: System Schema Information extraction methods will be used to automatically extract the values for the attributes of the named entities which were found with the above methods. We then plan to extract information about the relations between entities in a so-called scenario template extraction task. Such information extraction can take place on either MEDLINE abstracts or the more challenging, but potentially richer, full papers. One of key parts of this work is the construction and maintenance of an ontology for the domain which is executed by a system which we are now developing called Ontology Extraction-Maintenace System (OEMS ). OEMS extracts the three types of information about the domain-ontology, [10, 9], called typing information, from the abstracts: taxonomy (a subtype structure), mereology (a part-whole structure), synonymy (an identity structure). Scenario template task
Currently we are working on constructing a thesaurus automatically from terminology in MEDLINE abstracts and domain dictionaries for the purpose of query expansion in information retrieval of databases such as MEDLINE, e.g. see [5]. Choosing a set of features to represent proteins is an important factor. In our initial approach we have decided to use the noun phrases (identi ed with the EngCG parser) which co-occur inside a MEDLINE
Thesaurus building
3
Figure 2: System screen shot of GENIA used on a PC abstract with the protein name as the indexing terms. We are now testing the thesaurus on a large judgement set [4]. 3
Information Access Objectives
Users in many domains require timely and accurate information - this is particularly true in the medical domain where researchers and practitioners are trying to nd cures to deseases. The genome research community in particular was an early adopter of Web technology, nding it useful to diseminate large amounts of data. This data includes primary data on genome maps, DNA and protein sequences. Scienti c research in the biology domain can be enhanced through information extraction and well considered user interface and Web navigation facilities. The quality of information which we provide to users is however not solely dependent on the information extraction technology which we described above, but also on the way in which we present it, i.e. the interface and its useability. A key aspect of this useability is the availability of the tool and how it can be integrated with other information sources. For these reasons we decided from the start of the project to put the tool on the Internet. Below we describe the 4
Figure 3: System screen shot of GENIA after loading an abstract for processing key features of the interface which will be developed and also our thoughts on the Internet and accessability issues. 3.1
Interface
A recurring problem with interfaces has been the assumption that users can become pro cient given enough time. A lack of standardization puts the burdon on users rather than developers. In designing the GENIA interface we have tried to simplify the functionality as much as possible to hide tasks which are unrelated to the users' view of the objects in the text (such as linguistic processing for part of speech information) and also to avoid dicult to remember keyboard commands by using a point and click environment oered by hyper text markup language (HTML)[2] and JavaScript. Moreover, the underlying knowledge representation used by the software and the training corpora is SGML, but this is simpli ed and converted to HTML for the user's view. A key aspect of the project is providing easy interaction between users (domain experts) and the information extraction programs. Our interface provides a high-level link to the information extraction programs as well as navigation to aid in querying for related information from publically available databases on the WWW. This is done within the Web browser environment. We now 5
Figure 4: System screen shot of the result of the named entity task describe how a user can typically interact with the system. A user of GENIA can enter documents in one of two ways. Firstly by clicking on an \open" button which will allow him/her to select a locally stored HTML le. Secondly by entering a search query which is sent to MEDLINE and the articles are then retrieved and displayed in the main window. Figure 3 shows an image of the inline GENIA system at the point where the user has opened a local document \ex1". Under the le entry box can be seen a group of buttons (programmed in JavaScript) for local le operations and below these a set of `radio buttons' for selecting information extraction tasks. By selecting \Named Entity", the result shown in Figure 4 will be seen in which named entities are identi ed and classi ed. The entities are visualised to the user by highlighting in dierent colours depending on their category for easy recognition. For example, proteins in the text may appear in green and genes in red. We intend to make the entities in the HTML document `live' by allowing users to click on them to launch them as queries to pre-selected databases (an example of this can be seen in Figure 5). Alternatively the user can click on the article title or author to launch this as a query to the abstract database MEDLINE. By selecting another task, \Scenario template" from the task bar, the user 6
Figure 5: System screen shot of navigation to an external database can extract information relevant to an event. In our case the event we have chosen is cell signalling and it shows how proteins which are mentioned within the text are related to one another within the signalling path as de ned by a domain model. If such relations are found then they can be graphically represented in a structure diagram (shown in Figure 6). As mentioned above, although the data used by the software and corpus is encoded in SGML, which we use for corpus markup, it is converted into HTML for documents which the user can view. This is important as it allows us to make every document which the user sees in the browser `live', i.e. clickable and part of the overall navigation environment. In this way we can automatically embed HTML links to external documents or databases. For example, if we detect that a named entity is a protein, then we can embed a link or a routine allowing that term to be sent as a query to the SwissProt database. Thus the user feels that navigation is seemless between retrieved or processed documents and other sources on the Web. This also means that the user has a somewhat ltered view of content and structure compared to the system which processes the text. The navigation aspect of the interface means that researchers who wish to learn more about a particular gene can move from physical map, to clone, to 7
Figure 6: System screen shot of the application of theresult of the scenario template task to building cell signalling diagrams sequence, to disease, to literature reference and back again within the environment of their browser. The web is also increasingly being used as a front end to sophisticated analytical software. Sequence similarity search engines, protein structural motif nders, and even motif mapping programs have all been integrated into the Web. Within all of this framework it is natural to nd language processing software implemented by Java or a Common Gateway Interface links to a server. The GENIA tool represents an innovative presentation of information content from medical journals and abstracts with navigatable links to online databases, reference sources, and also protein structure databases (i.e. images). 4
User Feedback
Our experience so far in GENIA project has shown that we need to coordinate the application of expertese, not only across organizations but also across displines and domains. As developers of an information system we cannot work in isolation from our users (biologist researchers). Open and ongoing dialogue with biologists is crucial for building understand8
ing of complex issues such as technology limitations and user needs. We have found that although developers and users have quite dierent conceptual frameworks of the domain, only through dialogue can we as developers make the user feel involved as participants in the development of enhanced technology. In this process we also nd out how to best meet the user's needs. The rst stage of our project was to develop a mockup demonstration (available on the Web at [3]) which could show to both potential users and cooperating researchers how we imagined the end system would look. Although it has no functionality, the mockup provided a look and feel of a nal system which users responded positively to and allowed all of us to consider the system from a similar viewpoint. In particular, users' response to term visualisation and navigation facilities was very positive and gave us the con dence to continue research in this area. To this end we also consider that a corpus of MEDLINE articles which we are creating for testing is necessary for a number of purposes including: helping in discussions on formalising domain knowledge helping to coordinate discussions between cooperating research groups on
the data model
Our test collection is made from 150 MEDLINE abstracts tagged by a human expert in biology. Details of the corpus, tagging scheme and ontology which are now being developed can be found in [11]. 5
Current status and the Future
The system at present is still only in its early stages of development. At present we can successfully classify term candidate phrases, assuming perfect identi cation, with performance at 81 per cent as measured by F-scores using a statistical model trained on word lists. This rises to approximately 90 per cent using a decision tree model trained on a tagged corpus. Performance falls though to 39 per cent when we incorporate detection of candidate terms using shallow parsing and to 61 per cent when we detect term candidates with a decision tree method.(Results and analysis are the subject of another paper also submitted to NLPRS). Work now concentrates on adding functionality to the information extraction buttons though the use of HTML-generated server-side scripts. The scripts use the Common Gateway Interface (CGI) [7] and are written in Perl and C. We seek to ensure minimum changes to the actual HTML pages and concentrate on the underlying functionality of the GENIA tool set. References
[1] T. Berners-Lee and R. Cailliau. World Wide Web. Energy Physics, 1992. 9
Computing in High
[2] T. Berners-Lee and D. Connolly. Hypertext markup language: A representation of textual information and metainformation for retrieval and interchange. Internet Working Draft. CERN, Atrium Technology Inc. Work in progress: http://info.cern.ch/hypertext/WWW/MarkUp/HTML.html, 1993. [3] GENIA. Information on the GENIA project can be found at:, 1999. http://www.is.s.u-tokyo.ac.jp/~nigel/GENIA.html. [4] W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval (SIGIR'94), Dublin
3{6 July 1994.
, pages 192{200,
[5] Y. Jing and W. Croft. An association thesaurus for information retrieval. In Proceedings of RIAO'94, pages 146{160, 1994. [6] G. Jones, N. Collier, T. Sakai, K. Sumita, and H. Hirakawa. A framework for Cross-Language Information Access: application to English and Japanese, to appear. Computers and the Humanities, 1999. [7] R.. McCool. Common gateway interface overview. Work in progress: http://www.ncsa.uiuc.edu/overview.html, 1993. [8] MEDLINE. The PubMed database http://www.ncbi.nlm.nih.gov/PubMed/.
can
[9] Norihiro Ogata. Dynamic constructive thesaurus. In
be
found
at:.
Language Study and
Thesaurus: Proceedings of the National Language Research Institute Fifth
Session 1, pages 182{189. The National Language Research Institute, Tokyo, 1997.
International Symposium:
[10] Norihiro Ogata. A type-theoretic dynamic construction of a taxonomy and mereology from texts of speci c domains. pages 133{140, 1998. [11] Y. Ohta, Y. Tateishi, N. Collier, C. Nobata, and J. Tsujii. Building an annotated corpus from biological papers. In 59th Annual national convention of the IPSJ Zenkokutaikai (in Japanese), Iwate Prefectural University, (to appear) 28{30 September 1999. [12] M.E. Salomon and D. C. Martin. An electronic journal browser implemented in the world wide web. In 2nd World Wide Web Conference'94: Mosaic and the Web, 1994. [13] T. Sekimizu, H. Park, and J. Tsujii. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. In Genome Informatics. Universal Academy Press, Inc., 1998.
10
[14] A. Voutilainen. Designing a ( nite-state) parsing grammar. In E. Roche and Y. Schabes, editors, Finite-State Language Processing. A Bradford Book, The MIT Press, 1996.
11