From Text to Knowledge: a Unifying Document-Centered View of ...

From Text to Knowledge: a Unifying Document-Centered View of Analyzed Medical Language Pierre Zweigenbaum, Jacques Bouaud, Bruno Bachimont, Jean Charlet, Brigitte Seroussi, Jean-Francois Boisvieux Service d'informatique medicale, Assistance Publique - Hôpitaux de Paris & Departement de biomathematiques, Universite Paris 6

Abstract

that their levels of representation of medical information are simply not congruent. We present in this paper a paradigm which can help to address these two problems. In this paradigm, we consider the analyses of a text as enrichments to this text: the resulting document integrates the original text, \annotated" by the data produced by MLP systems. Extracted data no longer replace the original text, which can be checked by the information user, alleviating the need for 100 % reliable data. Besides, such an approach can facilitate the identi cation of data exchanged between MLP researchers. This view of text management is not new. It is widespread in the natural language processing (NLP) and document processing communities, and is becoming increasingly known in medical informatics too. Our purpose is to transpose it to the context of medical language processing. We rst recall some background material, and then present the principles of the proposed model. We illustrate the model on several examples, and show how the resulting enriched documents can be exploited. We discuss the advantages that such a \document-centered" view of analyzed medical language should bring both to nal users (health care professionals) and to MLP researchers. We also identify some limitations, and propose further directions of research.

Although medical language processing (MLP) has achieved some success, the actual use and dissemination of data extracted from free text by MLP systems is still very limited. We claim that the adoption of an \enricheddocument" paradigm (or \document-centered" view) can help to address this issue. We present this paradigm and explain how it can be implemented, then discuss its expected bene ts both for end-users and MLP researchers.

1 Introduction Some medical language processing (MLP) systems have achieved a reasonable level of performance and success from a technical point a view, the most well-known one being unarguably the Linguistic String Project Medical Language Processor (LSP/MLP, [1]). However, the actual use and dissemination of data extracted from free text by MLP systems is still very limited. In our opinion, the most crucial factor which impedes a wider use of these results by health care professionals involves the expected quality of the resulting data: one obviously cannot rely on extracted data as a substitute for the original text as long as this data is not produced by a near 100 % safe procedure. Free text processing has not reached such a level of performance | neither can one expect humans to. On another plan, MLP results are also not disseminated enough to the research community. Some conditions for this to happen are indeed political (individual and organisational willingness) and infrastructural (availability of network and servers). But there is also an issue of representation framework for such interchange to take place. At present, the kinds of results that are produced by different MLP systems are often so distant from one another

2 Background We present here the background which supports the paradigm proposed in this paper: a schematic reconstruction of MLP tasks, the emergence of a trend towards a document-centered patient record, the usefulness of annotated corpora, and some basics about document markup (see gure 1).

To appear in Methods of Information in Medicine. A rst version was presented at the \Fourth International Conference on Medical Concept Representation", workshop of Working Group 6 of IMIA, Jacksonville, Fl., January 1997.

1

such as diagnoses, acts, or anatomic localizations. Segmentation may be structured: segments may be nested in one another. For instance, a diagnostic expression may mention a speci c anatomical location: we then have an anatomic location segment inside a diagnosis segment. The basic model of main-stream syntax is that of constituent structure, which is indeed a structured segmentation. Sager's MLP system [1] can essentially be described as a structured segmentation and categorization system. Categorization and segmentation correspond to the paradigmatic and syntagmatic division in linguistics. It is therefore not a surprize that the data extracted from a natural language text can often be represented as a categorization of text segments. This accounts for a large part of MLP systems (e.g., [1, 7]), and does address a substantial part of the needs for data extraction from medical text. However, instead of assigning a text segment an atomic category drawn from an a priori set, one may choose to build for it a complex category. The hypothesis is that a nite set of a priori (pre-coordinated) categories is not sucient for dealing with the open-endedness of natural language, and that one therefore needs to rely on a generative language for describing a potentially in nite set of (post-coordinated) categories. This is the assumption made by those who rely on some sort of compositional language [4, 8] for representing medical information, and in particular by the tenants of knowledge representation languages [9, 10]. Such a complex categorization is generally not reducible to atomic categorizations of structured segments, and must therefore be considered as a distinct class of MLP production. The paradigm we present in this paper naturally caters for the rst two types of NLP output: atomic categorization and (structured) segmentation, and thus can help to manage a very large part of the results of MLP systems. It can also be made, with some restrictions, to host the latter type (complex categorization).

MLP results

Enriched document

Document-centered EMR segment and categorize text contents Annotated corpus for NLP development variably structured text and data SGML markup

knowledge about language

Figure 1: Background: convergence.

2.1 A reconstruction of medical language processing tasks MLP systems aim at extracting \medical data" from free text. The prototypical application is diagnosis encoding (e.g., [2]), where the desired output is a diagnosis code in an existing classi cation (e.g., the International Classi cation of Diseases [3]). This is a classi cation, or categorization, task: given a string of words and an a priori set of classes, determine the class which should be assigned to this string. An extended version of categorization can assign several classes (or indices, or keywords) to an input string, for instance, to associate a set of SNOMED [4] codes to an input sentence [5]. Some \ancillary" procedures in MLP are also categorization tasks: e.g., determining the part-of-speech of each word, or (more interestingly) determining the broad semantic category of each word or expression. The latter task plays a central role in [1], and [6] shows that its status can be promoted from ancillary MLP to an actual, end-user service. Although categorization may be the only task expected from an MLP system (e.g., in [2]), another important task is segmentation : given a string of words, identify relevant segments in this string. These segments correspond to meaningful units, for instance the expression of a diagnosis | which one might further want to categorize. Segmentation, in this context, allows to focus on (and categorize) speci c parts of a text, rather than considering it as an indistinct whole. An application of segmentation could be to identify in a text all instances of diagnoses. This task is dierent from the above categorization of a string which is already known to be a diagnosis. \Ancillary" segmentation tasks include morpho-syntactic segmentation (into words, sentences, phrases and other syntactic constituents) as well as the identi cation of semantic units

2.2 The document-centered record

patient

The traditional view of the electronic medical record (EMR) is that of a database which holds items of coded, or standardized, information. While this view has achieved some success, it does require a substantial eort of standardization over medical information. Moreover, a complete formalization of medical information is not theoretically reachable | this is precisely one of the reasons why natural language still has some appeal to medical practitioners. An alternate view gives textual documents a rst-class status. It is inspired both by an observation of noncomputerized medical practice (e.g., [11, 12]) and by the 2

fast-growing document processing industry. In that view, the EMR is a collection of documents, which can be composed of text, images, etc. Textual documents here can, and generally do, have structure. They can be hierarchically divided into identi ed chunks, down to any suitable level, thus providing much of the same data structuring advantages as databases. But at the same time, not every piece of text has to be standardized: the balance between formal data and natural language text can be more nely tuned than with traditional database models, and they can merge together more tightly. The expected bene t is rst a better account of available medical information, described as it can be expressed, rather than as it should be expressed to t a prede ned computer model. Assuming that a document-centered view of the medical record is closer to non-computerized practice than an item-oriented view, an additional bene t is a potentially better-suited interface for health care professionals to work on the patient record. Several attempts at experimenting with such a document-centered patient record are ongoing [13, 14, 15, 16, 17, 18]. The document-centered view aims at nding a suitable tradeo between free text and formal data. By producing data from free text, MLP therefore plays a central role in a document-centered patient record and should bene t from it.

word is marked for semantic category can be used to learn selectional restrictions or taxonomies [22]. The same idea lies behind some statistical approaches to MLP, which learn word to code associations on the basis of a manually provided training sample. The paradigm we present is based on the same idea of annotating a text with its analyses: we believe this can help MLP researchers both for their own developments and by facilitating the exchange of enriched corpora.

2.4 Marking up text

Managing structure, categories, and annotations in a text can be realized by inserting in it tags which mark this structure, specify the categories, and identify the annotations. The publishing industry has pushed forth an initiative which led to the adoption of an ISO standard for this purpose: the Standard Generalized Markup Language (SGML, [23, 24]). SGML has since then spread to a wide variety of industries and research disciplines. HTML (HyperText Markup Language), the language of the World Wide Web, is itself a xed application of SGML. SGML basically allows to de ne the nature and structure of the \tags" which can be inserted in a document. Tags delimit \elements" of a document, i.e., categorized segments, such as a section title, paragraph, diagnosis or person name. Tags may also associate \attributes" with elements, such as the level or number of a section title, the sex of a person or the standardized code for a diagnosis. The set of allowed tags in a class of documents and their authorized order of appearance and nesting are speci ed in a \Document Type De nition" (DTD), also written in SGML. Large SGML DTDs have been proposed for a wide variety of industries and scienti c disciplines. Of prominent interest for NLP is the Text Encoding Initiative [25], an international project which has led to a DTD encompassing a large part of the needs of the humanities, litterary and linguistic research. In the medical informatics arena, the HL7-SGML group has recently been promoting the use of SGML for the medical record [26].

2.3 Annotated corpora as a precious resource

The 80's have seen the development of increasingly larger on-line text corpora, and their dimension has now exploded with the advent of the World Wide Web. While large volumes of text are a very useful resource for studying language, for evaluating NLP tools and for training them [19], annotated corpora are an even more valuable asset for these same purposes. Annotated corpora link texts with their analyses: a typical, basic level of analysis may involve the identi cation of sentence boundaries or the part-of-speech of each word. More elaborate analyses associate syntactic parses with sentences or semantic categories with words. The resulting corpora are much more than text: they constitute sources of knowledge about language. Typical work on annotated corpora includes: evaluating NLP systems: this is how the DARPA MUC competitions have been proceeding [20]; training NLP systems: e.g., a part-of-speech tagger can tag accurately some 97 % of the words of a corpus provided it is given a large enough hand-annotated training corpus [21]; building resources for NLP systems; this is actually a variant of the preceding point: a corpus where each

3 The enriched document model In this section, we present the principles of the enriched document paradigm and sketch a possible implementation based on the SGML language.

3.1 MLP as a document enrichment process

We propose to handle the analyses which MLP can produce from a text as enrichments to this text. Once \data" 3

has been obtained, the original text is not \forgotten". It remains there in the forefront to help to manage the derived data. It is still the reference that the reader of the patient record may want to consult when examining this patient's data.

3.2 Indexing interpretations on the source text

The general picture of an enriched document is that of a text, some portions of which are associated with \analyses", or \interpretations" (in the above-proposed terms, \categorizations"). A very simple principle is applied throughout the model: index these interpretations on the source text that they interpret (see gure 2). This gives a deserved importance to the segmentation aspect of NLP. The same principle was at the root of seed models in computational linguistics (the chart model for parsing [27]) and arti cial intelligence (the blackboard model for problem-solving [28] and the ATMS model for belief maintenance [29]). original text is still available

MLP-provided data can be useful as normalized search data. By indexing analyses on their source text, one can thus link back to the source text from the search data. As a consequence, since text and corresponding data are synchronized, one can have search processes operate on normalized data while letting the user view understandable text. This back link is also an essential feature to enable traceability and quality control of MLP results.

Since several such interpretations can be synchronized, search criteria operating on several levels of interpretation can be combined together, resulting in greater search power. For instance, one can search for a given diagnosis (segment indexed with a \diagnosis" category) in a speci c section (segment indexed with a \section" category) of a discharge summary { and search or view the corresponding original text. Finally, if an analysis produces a categorization, this categorization is inherited by the text segment indexed with it. Such a categorization can be made perspicuous to the reader through appropriate markup and rendering devices [6].

add as many interpretations as useful (categorization)

can interpret selected parts of document (structured segmentation)

two-way link search interpretations or text

Figure 2: Indexing interpretations on the source text.

Let us review some of the properties this indexing brings to the resulting enriched document.

The default entry point into the document remains the original text. None of its structure or contents is lost in the new document, as might be the case in an item-oriented paradigm. Therefore, the basic structure, the basic reading of the document is that of the source text. The importance of the structure and physical aspect of documents in the patient record have been stressed by some authors (in particular, [11]): keeping these features should be helpful to the reader.

one MLP process could identify and categorize the overall structure of the text (reason for admission, antecedents, etc.) whereas another one could determine the SNOMED encodings of the relevant expressions in the text. This possibility of task division has implications both for MLP architectural issues and for research team collaboration. An analysis of only some parts of a text can be handled. Moreover, the actual piece of text which gives rise to a speci c analysis can be precisely identi ed. By applying categorization to a segmented text, one can make the dierence between an analysis of a whole text and an analysis of a limited extract of this text. For instance, a diagnosis will not have the same meaning if it is presented as the diagnosis of a patient (attached to the whole report) or as a diagnosis found in the \reason for admission" section or in the \antecedents" section.

3.3 A simple implementation of the enriched text model The main features of the enriched text model can be implemented with a few SGML constructs.

3.3.1 Segmentation and categorization

Several interpretations of the text can be added monotonically to the same document. For instance,

A basic kind of analysis simply consists in categorizing text segments. Text segments can range from a (part of)

4

word to a whole paragraph, section or document. Categories may be morphosyntactic (part-of-speech, lemma, syntactic category) as well as semantic (broad semantic category, word sense, position in a thesaurus) or conceptual (position in an ontology). Another kind of segmentation concerns the general structure of the text (header, signature, divisions, paragraphs). An analysis which results in the categorization of a text segment can be recorded as an encapsulation of this text segment within a start and end tags assigned to this category. For instance, the text on gure 3 is tagged with SGML elements denoting segmentation and categorization of structural units (div, head, p) and semantic units (name, date, diagnosis). A variant uses attribute values to specify categories. In the same example, a semantic categorization of divisions is encoded as a type attribute, and a precise semantic category for diagnoses is speci ed as an ICD9CM attribute whose value is a position (a hierarchical code) in the ICD-9-CM classi cation. Figure 4 shows an additional example where the beginning of a patient discharge summary is tagged with divisions (div, head), paragraphs (p), person names (name), health care locations (locsoins, typelocs, nomlocs), dates (date), symptoms and diagnoses (etatpath), medical treatments (theramed, drug, poso) etc. The rules that govern which elements and attributes can be used and how they can nest are speci ed in an SGML document type de nition (DTD). Figure 5 shows an extract of the DTD enforced in the document of gure 4. Each legal element is declared (!element declaration) with its allowed attributes (!attlist declaration). The content of an element can be simple text (#PCDATA) or other SGML elements.

possible to keep the analyses outside the source document, so that it remains untouched (and therefore may have a read-only status, a safe way to implement data integrity), using external references to positions in this document. HyTime links [32], in particular, allow to do this. A limitation of the proposed encoding for \complex" analyses is that they are stored as plain text rather than encoded as markup (elements and attributes). As a consequence, the SGML processing mechanisms built in SGML tools do not help much for processing (e.g., searching) these representations.

4 Uses of enriched documents 4.1 Overview of uses

We envision two broad classes of users for enriched medical documents: health care professionals and medical informatics researchers (see gure 7. We de ne the rst class as being interested primarily in working on patient data. The second class is concerned with methods and tools for processing medical language and knowledge.

RC

E

LP

M A U G N LA

Clinical use

RD

CO RE

G

E

L

A

RE

IC

More complex analyses need to be represented explicitly. The TEI has de ned several notations for managing such analyses (see, e.g., [30]). The simplest one consists in inserting the analysis as text enclosed within speci c start and end tags, at the place in the text where it belongs (interlinear analysis). Assuming for instance that we want to include the conceptual representation of each sentence as derived by an MLP system such as MENELAS [10] or RECIT [31], we can segment the text into sentences (element s), each of which includes this conceptual representation (in a cr element | see gure 6). The corresponding DTD must indeed cater for these elements. In variant encodings, such analyses are grouped together in a special part of the document (for instance, in the end) and SGML reference links tie together a text segment and its analysis. This has the advantage of keeping the \main" part of the document \clean". It is even

SO

ED

U

M

3.3.2 More complex analyses

MLP research

Figure 7: The enriched document model. An enriched document may be used in (at least) three ways. One may simply want to consult the document | the enrichments can help reading or navigating through the document. One can obtain information or data from the document | for instance, extract patient information to be sent to another person or system. Finally, one may want to store data into the document | including writing the original text. The last two operations may be combined, for instance when applying NLP tools to a text and 5

MOTIF D'HOSPITALISATION
Monsieur DURAND, ^ ag e de 88 ans, est hospitalis e le 18/1/1997 pour un angor spontan e a r ep etition.

Figure 3: Categorization and segmentation.
MOTIF D'ADMISSION
Monsieur DURAND, ^ ag e de 61ans, est admis dans l'Unit e de R eanimation du Service de Pneumologie du Groupe HospitalierPiti e Salp^ etri ere le 26/7/90 pour prise en charge d'une isch emie aigu e du membre inf erieur gauche avec insuffisance r enale oligo-anurique.

PROVENANCE :
UrgencesPiti e.

Figure 4: A more complete example of tagged text. storing the results as appropriate enrichments. We rst examine several uses of enriched documents in a health care setting. We then recall the main bene ts we see for the medical informatics research community.

instance, linking together all mentions of problems aecting a given anatomical location, or linking the mention of an examination in a discharge summary to the actual, full examination report.

4.2 Consulting enriched documents

The rst two operations on an enriched document are extremely simple to implement with SGML-aware software. For instance, using an SGML browser (e.g., Panorama PRO, [33]), one can de ne a style sheet which speci es how each kind of SGML element should be rendered. Figure 8 shows a Panorama display of the document of gure 4, where section heads are in boldface, symptoms and diagnoses are in italics, examination results are indented, and person and location names are replaced with an icon. With this same SGML tool, one can de ne \navigators", user-tailored tables of contents which are presented in a companion window and synchronized with the main text window. Figure 9 shows a navigator based on section heads and symptoms and diagnoses. This navigator can be used both as a synopsis to get a quick feel for the patient's history and, indeed, as a navigation device: a

The segmentation and categorization in an enriched document can be used to help a reader consult this document:

a basic idea is to use the segmentation and categorization to compose the layout of the document | for instance, to highlight segments of a speci c category [6]; one can further present a dierent view of the same document t to a speci c goal or task | for instance, creating a table of contents by extracting the section titles, or creating an index or a synopsis by extracting the pathological states or observations [15]; one can also link together segments in the same or in dierent documents, thus providing hyperlinks: for 6

- - (#PCDATA)> - - (#PCDATA)>

- - (#PCDATA) - - (#PCDATA)

%mFloats > %mFloats >

-

%mFloats >

measure grandeur value num unit

-->

-

(grandeur, value) (#PCDATA)> (num, unit?)> (#PCDATA)> (#PCDATA)>

-->

Figure 5: A document type de nition (extract). click on one of the items in this window scrolls the main window to the actual occurrence of this item in the source text. With the appropriate tools (here, the Panorama Pro SGML browser), these helps come at a marginal cost once the text is tagged. The third feature (hyperlinks) generally requires that the desired links exist,and encodes them as appropriate tags in the document. These links could be precomputed, e.g., by NLP software, or be user-de ned.

powerful searches on both the original text and the data extracted from it. As hinted above, such tools know how to manipulate SGML markup and plain text. However, confronted with, for instance, the conceptual graph of gure 6, the best they can do is to consider it as text. As a result, the full formal meaning of the conceptual graph is lost. Nevertheless, concept and relation names can be searched as if they were keywords, which can still be useful (as a fallback) depending on the kind of queries which are processed. A better account of such elaborate representations requires some investment beyond a simple use of o-theshelf SGML software.

4.3 Searching a collection of enriched documents

The basic search mechanism in an item-oriented database deals with formal data. Searching natural language texts needs to resort to dierent methods, namely, contentoriented search, also called full-text search. Recent software allows to apply full-text search to structured text, taking into account its segmentation and categorization features. So for instance, one can search for occurrences of the word \kidney" in diagnosis elements that are nested in div elements with attribute value type='reason for admission'. One can at will choose to search on text, on tags (and attributes), or on a combination of both: encoding MLP analyses as SGML-tagged segments in enriched text enables the application of o-the-shelf software to perform

4.4 Extracting information

An enriched document is an elaborate compound from which one may wish to extract only a selected subset. This may be useful, for instance, in a work ow context to produce a targeted report to be sent to a correspondent. This is also convenient to build dierent views of the same document for consultation, as suggested above. The basic mechanism for these purposes is the selection, in tagged texts, of speci ed segments, based on markup and content, and their rearrangement in the desired order to produce the target document. This mechanism can be eected by SGML transformations, for which a vari7

MOTIF D'HOSPITALISATION
Monsieur DURAND, ^ ag e de 88 ans, est hospitalis e le 18/1/1997 pour un angor spontan e a r ep etition. [Admission](past) (pat)->[Human_Being](defines_cultural_function)-> [Medical_Subfunction]--(cultural_role)->[Patient:I63] (attr)->[Age]--(val_quant)-> [Quantitative_Val:88]--(reference_unit)->[Year_Duration]% (motivated_by)->[Angina_Syndrome:I77] ...

Figure 6: Embedding a conceptual representation. (Windows 95 screen dump)

Figure 8: Rendering an enriched document. ety of commercial or public-domain software exists. One may cite in particular the SGML query language SgmlQL developed in the Multext project [34], which allows to specify such transformations in a declarative way. One should note that SGML transformations are a more general and powerful mechanism than simple style sheets. DSSSL (Document Style Semantics and Speci cation Language, [35]) aims at standardizing this notion. Performing SGML transformations will be easier and more portable when implementations of DSSSL become widely available.

program, which produces an ICD code for each of them; store the resulting ICD codes into an ICD9CM attribute of the diagnosis element. The Multext project [34] has de ned a general NLP architecture where components work on an SGML encapsulation of natural language data. Note that MLP programs need not all be SGML-aware: Multext also proposes alternate encodings and transcoders so that usual NLP components can be plugged into such a processing chain. Such an architecture facilitates the reuse of existing MLP components: its strength relies on the de nition of a set of analysis levels (input or output of NLP components) and the corresponding encodings. Some general-purpose levels have been identi ed and some encodings proposed by NLP projects such as Multext or the TEI [25]. An obvious need for this kind of architecture to be really useful for MLP is to de ne relevant levels of analysis for medical texts. We return to this point later.

4.5 Enriching a document

Enriched documents accomodate expansion in at least two contexts. First, health care professionals must be able to enter new information into the patient record (with the usual precautions to comply with authorizations and ensure data integrity). Beyond the mere creation of new documents, a reader can annotate existing documents [36]: the document-centered paradigm facilitates \active reading". Second, MLP programs can be applied to the current state of a document and perform interpretation tasks as proposed above. The basic ow of control consists in selecting input segments, passing them to the MLP program, and then inserting the results back into the enriched document structure. A typical processing chain could be: select all occurrences of diagnoses (diagnosis elements, identi ed in a previous pass); pass them to an encoding

4.6 Exchanging enriched documents

The general NLP community has been exchanging data and programs for a while now: one can nd repositories for lexicons, annotated corpora, parsers and other NLP resources in multiple places over the world (see for instance the Natural Language Software Registry [37] or the European Language Resources Association [38]) | although one must stress that resources are most developed for the English language. 8

(Windows 95 screen dump)

Figure 9: A synopsis and navigation help. The MLP community, in contrast, lacks a comparable level of sharing of speci c resources. This point was put forward at the last meeting of Working Group 8 (Natural Language Processing) of the European Federation for Medical Informatics, which was held during the Medical Informatics Europe Conference in August 1996. Enriched documents constitute annotated corpora which can be extremely useful to the MLP research community. They allow to compare more precisely dierent approaches, thanks to common test data, and would be a central piece of material in evaluations such as discussed by [39]. For instance, to work on automated diagnosis coding, a necessary material is a corpus of already coded diagnoses. A corpus of texts (e.g., patient discharge summaries) where diagnoses are identi ed and coded (i.e., segmented and categorized) with conventional SGML elements and attributes would be a valuable resource to share for this purpose. Annotated corpora can also be used to train coding programs. For instance, a large enough corpus where each word is tagged with a broad semantic category (e.g., Sager's 39 semantic categories, or the categories derived from SNOMED's 11 axes) could provide training material for a semantic tagger. The normalized document or corpus \header" provides room to describe the contents of and speci c conventions applied in the annotated corpus. We believe that the speci cation of a general set of MLP-produced data would help the MLP community focus on useful data to be exchanged. Furthermore, de ning encodings for these data, in the form of SGML Document Type De nitions, would provide a more precise framework and would facilitate data exchange. Here again, the work of the TEI [25] is of interest: one of the main goals of the TEI was precisely to facilitate the exchange of annotated textual data between scholars. A valuable feature of the TEI scheme is the provision for each text or corpus of a header [40] where information about the document can be inserted: author, mode of construction and encoding, etc. This self-documentation is particularly useful when exchanging corpora between dierent teams.

cerned with partially structured information: a mixture of text and data, where natural language itself can undergo some variable degree of structuring. This feature of traditional (paper-based) medical records is considered essential by authors such as [11]. This is dierent from approaches which assume a strict division between structured data and free text, as we understand that HL7 and its SGML initiative does. This is also dierent from approaches where structured data is accessed and displayed through HTML front-ends [41]. Second, we use SGML to tag the logical content of documents, rather than their external presentation. The actual presentation and layout of documents must be determined by separate style sheets | which can be important too, but must be designed at a dierent level. This contrasts with methods which directly encode the presentation of documents with speci c SGML markup, for instance HTML. Whereas MLP produces data from natural language text, it does not have to replace text with this data. A more conservative approach simply adds this data to the original text. As a consequence, depending on the use which is made of the resulting document, a 100 % eciency is not necessary. One can draw a parallel with information retrieval: 100 % recall and precision are not necessary (neither are they available) to provide useful services to users. Answers to queries come in the form of actual texts or abstracts, which the user can read to make his/her nal assessment. Since the user is \in the loop", the program only serves as an assistant; the nal thinking and decision belong to the user. The enriched document approach allows to follow a similar principle: although search and navigation can work on MLP-produced data, the user can always be presented with the synchronized, original text or text extract | together with the data if relevant. Note that the consultation of an MLP-enriched document has something more than a \classical" (hyper)text. Navigation can rely not only on explicit text and links, but also on the attached, underlying MLP-produced interpretations. For instance, given the appropriate tools, one can navigate through the links induced by common semantic categories (e.g., go from one diagnosis to the next in a text) or search for text segments whose attached analysis at a given level satisfy some criterion (e.g., nd all occurrences of conditions aecting the lower limbs, based on a SNOMED encoding of the text). This opens up a whole range of dynamic navigation mechanisms. MLP deals with tasks of various complexities. For instance, identifying segments of a given broad semantic cat-

5 Discussion The enriched document paradigm presented here relies on an SGML encoding of medical text and data. As already mentioned earlier, SGML is making its way into the medical informatics community. We would like to stress two important speci cities of our proposal. First, we are con9

6 Conclusion

egory (e.g., all occurrences of anatomical locations) is less dicult than providing a detailed conceptual representation for each sentence. Indeed, the results of MLP systems are generally less good as the diculty of the task increases. The proposed enriched document architecture can allocate some room for the dierent levels of MLP results. The exploitation of the available data is then more or less ecient according to the available levels and the quality of MLP components. For instance, consultation helps (e.g., layout and rendering | see above) can be of varying quality depending on which segmentations and categorizations are available and how accurate they are. Here again, the approach is well suited to target services that can accomodate some degree of imprecision or incompleteness in their input data, and that give a major role to text. It allows to put to work in the short term the simpler MLP tools, and to introduce more sophisticated tools gradually, instead of having to wait until MLP is perfect before it can be used.

We have described a general paradigm for managing data produced by medical language processing systems and, more generally, medical data, from its textual form to a knowledge representation. The basic principle is to associate NLP-produced data to its source text, the natural implementation being based on SGML markup and tools. Although this does not constitute in itself a new NLP technique, we have shown that this principle can bring many bene ts to MLP, both by fostering shorter term application of current MLP components in a health care setting, and by facilitating data interchange between MLP research teams. The major need for this paradigm to be eective is the de nition of an architecture and document type de nitions for MLP-produced data. The de nition of DTDs for medical documents is an independently motivated goal for the document-centered medical record [12, 18] or for a data-oriented electronic medical record [26]. We believe that these requirements should be joined within a common \medical document encoding initiative".

The underlying assumption throughout this paper is the possibility to specify a common set of general data types and their encodings for MLP-processed data: in other words, SGML document type de nitions (DTDs) for enriched documents. This work still remains to be done, and is by no means an easy one. It is strongly linked to the de nition of DTDs for medical language documents, which is itself closely related to the de nition of the electronic medical record. Several tracks already go some way in this direction. In the NLP community, enterprises such as the Text Encoding Initiative or the Multext project have made steps towards DTDs for various kinds of texts and analyses. The TEI DTD, in particular, is a very wide-coverage, extremely parameterizable, document type de nition which can be thus adapted to many needs. In the medical informatics community, eorts such as CEN TC251 or HL7 have considered the encoding of many kinds of medical data; both are considering the use of SGML to pursue their tasks.

7 Acknowledgments We thank Vincent Brunie for useful discussions about SGML and SGML tools, and Dr. Thomas Similowski for providing patient record material for experimentation.

References [1] Sager N, Lyman M, Bucknall C, Nhan NT, and Tick LJ. Natural language processing and the representation of clinical data. Journal of the American Medical Informatics Association 1994;1(2):142{60. [2] Gundersen ML, Haug PJ, Pryor TA, et al. Development and evaluation of a computerized admission diagnoses encoding system. Computers and Biomedical Research 1996;29(1/2):351{72. [3] Department of Health and Human Services, Washington. International Classi cation of Diseases | Clinical Modi cations, 9th revision, 1980. [4] Côte RA, Rothwell DJ, Palotay JL, Beckett RS, and Brochu L, eds. The Systematised Nomenclature of Human and Veterinary Medicine: SNOMED International. College of American Pathologists, North eld, 1993. [5] Oliver DE and Altman RB. Extraction of SNOMED concepts from medical record texts. In: Proceedings of the 18th Annual SCAMC, Washington. Mc Graw Hill, 1994:179{83.

The proposed architecture does not solve the issue of the incompatibility (or uncomparability) of dierent representations of medical data. It can help to identify, though, that representations concern the same level of interpretation (e.g., dierent controlled vocabularies for coding diagnoses; or dierent knowledge representations for describing the meaning of a sentence), and provide notations to explicit the nature and extent of each representation; but it does not include means to go from one representation to another. Projects such as the UMLS [42] or GALEN [9] do address this issue. The enriched document paradigm can help, for its part, to anchor these dierent representations to natural language texts and to one another. 10

[6] Sager N, Nhan NT, Lyman M, and Tick LJ. Medical language processing with SGML display. In: Cimino [43], October 1996:547{51. [7] Zingmond D and Lenert LA. Monitoring free-text data using medical language processing. Computers and Biomedical Research 1993;26:467{81. [8] Friedman C, Alderson PO, Austin JH, Cimino JJ, and Johnson SB. A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association 1994;1(2):161{ 74. [9] Rector AL, Nowlan WA, and Kay S. Conceptual knowledge: the core of medical information systems. In: Lun KC, Degoulet P, Piemme T, and Rienho O, eds, Proc MEDINFO 92, Geneva. North Holland, 1992:1420{6. [10] Zweigenbaum P and Consortium Menelas . Menelas: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine 1994;45:117{20. [11] Nygren E and Henriksson P. Reading the medical record. I. Analysis of physicians' ways of reading the medical record. Computer Methods and Programs in Biomedicine 1992;39:1{2. [12] Seroussi B, Baud R, Moens M, et al. DOME nal report. DOME (EU Project MLAP-63221) Deliverable D0.2, DIAM | SIM/AP-HP, 1996. [13] Nygren E, Johnson M, and Henriksson P. Reading the medical record. II. Design of a human-computer interface for basic reading of computerized medical records. Computer Methods and Programs in Biomedicine 1992;39:13{25. [14] Adelhard K, Eckel R, Holzel D, and Tretter W. A prototype of a computerized patient record. Computer Methods and Programs in Biomedicine 1995;48:115{9. [15] Bouaud J and Seroussi B. Navigating through a document-centered computer-based patient record: a mock-up based on WWW technology. In: Cimino [43], October 1996. [16] Carvalho G. A. Modelisation documentaire du dossier patient : une experience. Rapport de DEA, Informatique Biomedicale, Universite Paris 5, 1996. [17] Laforest F, Frenot S, and Flory A. A new approach for hypermedia medical records management. In: Brender J, Christensen JP, Scherrer JR, and McNair P, eds, Proceedings of MIE'96, Copenhagen, DK. IOS Press, 1996:1042{6.

[18] Bachimont B and Charlet J. Hospitexte : station de consultation pour un dossier patient dematerialise. Technical Report 96{170, Departement Intelligence Arti cielle et Medecine - SIM/AP-HP, 91, bd de l'Hôpital, 75634 Paris cedex 13, France, 1996. [19] Armstrong S. Using Large Corpora. MIT Press, 1994. Reprinted from Computational Linguistics, 19(1/2), 1993. [20] Lehnert W and Sundheim B. A performance evaluation of text-analysis technologies. The Arti cial Intelligence Magazine 1991(Fall):81{94. [21] Brill E. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 1995;21(4):543{65. [22] Basili R, Pazienza MT, and Velardi P. What can be learned from raw texts? Machine Translation 1993;8:147{73. [23] ISO . Standard generalized markup language. ISO 8879, International Organization for Standardization, 1 rue de Varembe, C.P. 56, CH-1211 Geneve, 1986. [24] Goldfarb CF. The SGML handbook. Oxford University Press, 1990. [25] Sperberg-McQueen CM and Burnard L, eds. Guidelines for Electronic Text Encoding and Interchange (TEI P3), Chicago, Oxford, 1994. ACH, ACL and ALLC. [26] Alschuler L, Lincoln TL, and Spinosa J. Medicine for SGML. In: Proceedings GCA SGML'96, 1996:181{9. [27] Kay M. Algorithm schemata and data structure in syntactic processing. In: Grosz BJ, Jones KS, and Webber BL, eds, Readings in Natural Language Processing. Morgan Kaufmann, Los Altos, Ca., 1986:35{ 70. [28] Nii HP. Blackboard systems: the blackboard model of problem solving and the evolution of blackboard architectures. The Arti cial Intelligence Magazine 1986;7(2,3):38{53,82{106. [29] de Kleer J. An assumption-based truth maintenance system. Arti cial Intelligence 1986;28(1). [30] Langendoen DT and Simons GF. A rationale for the TEI recommendations for feature-structure markup. In: Ide and Veronis [44], 1995:191{209. Reprinted from Computers and the Humanities 29, 1995. 11

[31] Baud RH, Rassinoux AM, Wagner JC, et al. Representing clinical narratives by Sowa's Conceptual Graphs. Methods of Information in Medicine 1995;34(1/2).

[43] Cimino JJ, ed. Proceedings of AMIA Annual Fall Symposium 96, Washington DC, October 1996. AMIA (JAMIA supplement). [44] Ide N and Veronis J, eds. The Text Encoding Initiative. Kluwer Academic Publishers, Boston, 1995. Reprinted from Computers and the Humanities 29, 1995. [45] Clark J. ISO/IEC DIS 10179.2: Document style semantics and speci cation language (DSSSL). WWW page http://www.jclark.com/dsssl/, 1995.

[32] DeRose SJ and Durand DG. Making Hypermedia Work { A User's Guide to HyTime. Kluwer Academic Publishers, 1994. [33] SoftQuad, Inc., Toronto. Panorama Pro for Microsoft Windows, 1995. [34] Veronis J. MULTEXT home page. WWW page http://www.lpl.univ-aix.fr/projects/multext/, CNRS et Universite de Provence, 1996. [35] ISO . Document style semantics and speci cation language. ISO/IEC 10179, International Organization for Standardization, 1 rue de Varembe, C.P. 56, CH-1211 Geneve, 1995. See also [45]. [36] Bachimont B. Intelligence arti cielle et ecriture dynamique : de la raison graphique a la raison computationnelle. In: Colloque de Cerisy La Salle \Au nom du sens" autour de et avec Umberto Eco. Grasset, Paris, 1996. To appear . [37] Natural Language Software Registry . The natural language software registry online research server. WWW page http://cl-www.dfki.unisb.de/cl/registry/draft.html, 1996. [38] European Language Resources Association . ELRA home page. WWW page http://www.icp.grenet.fr/ELRA/home.html, 1996. [39] Friedman C and Hripcsak G. Evaluating natural language processors in the clinical domain. In: Chute CC, ed, Proc Conference on Natural Language and Medical Concept Representation, Jacksonville, Fl. IMIA WG6, 1997:41{52. [40] Giordano R. The TEI header and the documentation of electronic texts. In: Ide and Veronis [44], 1995:75{ 84. Reprinted from Computers and the Humanities 29, 1995. [41] Cimino JJ, Socratous SA, and Grewal R. The informatics superhighway: Prototyping on the World Wide Web. In: Gardner RM, ed, Proceedings of the 19th Annual SCAMC, New Orleans. AMIA (JAMIA supplement), 1995:111{5. [42] Humphrey BL and Lindberg DAB. Building the Uni ed Medical Language System. In: Proceedings of the 13th Annual SCAMC, Washington. IEEE, 1989:475{80. 12

From Text to Knowledge: a Unifying Document-Centered View of ...

From Text to Knowledge: a Unifying Document-Centered View of ...

Suggest Documents