integration and XML-based text/data retrieval are combined .... TIPSTER architecture [7], each tag information is ..... General Architecture for Text Engineering,.
XML Tag Information Management System – A Workbench for Ontology-based Knowledge Acquisition and Integration Hideki Mima1, Sophia Ananiadou2, Goran Nenadic2, Junichi Tsujii1 1
Dept. of Information Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan {mima, tsujii}@is.s.u-tokyo.ac.jp 2
Computer Science, University of Salford Newton Building, Salford M5 4WT, UK {S.Ananiadou, G.Nenadic}@salford.ac.uk
Abstract In this paper, we propose an integrated information management system in which ontology-based knowledge integration and XML-based text/data retrieval are combined using tag information and ontology management tools. The main purpose of the system is to implement a query answering system for XML-based documents in the domain of molecular biology. The aim is to provide efficient access to heterogeneous biological and genomic text ual data and databases , enabling users to use a wide range of textual and non-textual resources effortlessly. Our system includes automatic term recognition, context-based automatic term clustering, ontology-based inference, and intelligent tag information retrieval. Through further description of the implementation and architecture of the system, we will demonstrate how the system can accelerate knowledge integration on the textual and non-textual resources even for novice users of the field. Keywords knowledge acquisition, automatic term recognition, term clustering, interval operation, ontology
1.
Introduction
With the recent dramatic increase of importance of electronic communication and data-sharing over the Internet, there exists an increasingly growing number and amount of publicly accessible knowledge sources, both in the form of documents and fact data bases.
These knowledge sources (KSs) are intrinsically heterogeneous and dynamic. They are heterogeneous since they are autonomously developed and maintained by independent organizations with different purposes in mind. They are dynamic since constantly new information is being revised, added and removed. Such heterogeneous and dynamic nature of KSs imposes challenges on systems that help users to locate information relevant to their needs. In attempt to handle such dynamic nature of knowledge sources efficiently, several efforts have been proposed based on knowledge acquisition (KA) [13] and knowledge integration (KI) framework [17]. Further, the Semantic Web framework [1] has been proposed in order to link relevant Web resources using Resource Description Framework (RDF) [2] and an ontology. However, although the semantic web framework is powerful in expressing content of resources to be semantically retrieved, some manual description is expected using the RDF/ontology. In addition , no further answer to the ontology conflictions/mismatches [16] is provided. An additional problem for efficient and consistent knowledge acquisition and integration is the lack of clear naming conventions in the domain of molecular biology, which necessitates the automated handling of terminological variations. However, recent solutions such as [8] cover only syntactic and morpho-syntactic variations. Currently, there are no proposals for the proper handling of semantic variations. In this paper, we propose an integrated information management system in which ontology-based knowledge integration and XML-based text/data retrieval are combined using tag information and ontology management tools. The management of knowledge resources, similarly to the Semantic Web, is based on XML, RDF, and ontology-based inference.
However, our aim is to facilitate the KA and KI tasks not only by using manually defined resource descriptions, but also by exploit ing NLP techniques such as automatic term recognition (ATR) and automatic term clustering (ATC). ATR and ATC can be used for semantic term variation by automatically and uniformly organizing knowledge resources. The ideas that will be presented are implemented in the domain of molecula r biology. Currently, the size of knowledge in that domain is increasing so rapidly that it is impossible for any domain expert to assimilate the new knowledge without automated help. Vast amounts of knowledge still remain unexplored and this poses a major handicap to this knowledge intensive discipline. Our system aims to help biologists to systematically gather data crucial for their research. The paper is organized as follows: in the next section, we present the overall architecture of the system and briefly describe the components incorporated in the system, while section 3 presents the proposed method for knowledge integration and management. In section 4 we present some results and evaluation, while section 5 concludes the paper.
2.
TIMS – system architecture
In order to facilitate an efficient integration and organization of KSs, we have developed an XML-based Tag Information Management System (TIMS). The main purpose of the TIMS workbench is to implement a knowledge integration tool and a query answering system for XML-based documents in the domain of molecular biology. TIMS integrates the following modules via XML-based communication: the JTAG annotation tool, the ATRACT automatic term recognition/clustering tool and the LiLFeS abstract machine. Figure 1 shows the system architecture.
2.1 JTAG JTAG is an XML-based manual annotation and resource description aid tool. A purpose of the tool is to support manual annotation, such as tagging, adjusting term recognition results, developing RDF logic, etc. All the annotation can be achieved via GUI. In addition, ontology information described in XML can also be developed and modified using the tool.
ATRACT Automatic Term Recognition and Term Clustering
Document/ Database Retriever XML data XML data
LiLFeS Syntactic and Semantic Parse r / RDF and Ontology Manager
XML data
JTAG
XML Data XML Data Retrieval Management
Manual Resource Description Aid Interface XML data
TIMS
Tag Information Database
Figure 1. System architecture of TIMS
2.2 ATRACT : A Terminology Workbench In the domain of molecular biology, there is an increasing amount of new terms that represent newly created concepts. Existing term dictionaries cannot cover the needs of specialists, and therefore automatic term extraction tools are important for consistent term discovery. ATRACT [11] is a terminology workbench that integrates automatic term recognition (ATR) and automatic term clustering (ATC). Its main aim is to help biologist to gather and organize terminology in the domain. Our ATR method is based on the C/NC-value method [3]. It dynamically recognizes terms in texts, which often include unknown, unregistered words and/or new term candidates. This method is a hybrid approach, combining linguistic knowledge (term formation patterns) and statistics (frequency of occurrence, term length, etc). Also, acronym acquisition and resolut ion are incorporated in the ATR process. The ATR module retrieves terms on the fly and sends the results as XML tag information to the TIMS workbench. Besides term recognition, term clustering is an indispensable component in a knowledge organization process. In molecular biology, terminological opacity and variation are very common. Therefore, term clustering is essential for semantic integration of terms, construction of domain ontology and for choosing appropriate semantic tag information. The automatic term clustering (ATC) method is based on Ushioda’s AMI (Average Mutual Information)- hierarchical clustering method [18]. Our implementation uses parallel symmetric processing for high speed clustering and is built on the C/NC-value results. As input, we use co-occurrences of automatically recognized terms and their contexts and the output is a dendrogram of hierarchical term clusters (like thesaurus). The calculated term cluster information is stored in LiLFeS
(see below) and combined with a predefined ontology according to the term classes automatically assigned.
Part-of-speech tags
2.3 LiLFeS
Tag name
start
end
...
NOUN
5
10
…
VERB
15
26
…
ADJ
100
110
DNA
140
Tag name
LiLFeS [14] is a Prolog-like programming language and language processor used for defining definite clause programs with typed feature structures. Since typed feature structures can be used like first order terms in Prolog, the LiLFeS language can describe various kinds of applications based on feature structures. Examples include HPSG parsers, HPSG-based grammars and compilers from HPSG to CFG. Furthermore, other NLP systems can be easily developed because feature structure processing can be directly written in the LiLFeS language. In the TIMS workbench, LiLFeS is used to 1) infer relevance between terms using type hierarchy processing, and 2) parse sentences using HPSG-based parsers and convert the results into an XML-based formalism. Figure 2 shows the model of knowledge integration using TIMS and LiLFeS. LiLFeS plays a central role in ontology-based inference, such as logic inference and hierarchical matching of classes using term ontology information.
Expression Model
Processing Model
GUI Layer
TIMS
XML (F-Structure) Layer
RDF (Logic) Layer
LiLFeS
Term Ontology Layer
Figure 2. Model of knowledge integration in TIMS workbench
3.
Knowledge Integration and Management in TIMS
The knowledge integration and management in TIMS are organized by integrating XML-data management and tag- and ontology-based information retrieval.
3.1 The XML tag data management In an attempt to accumulate a large amount of linguistic information for accelerating language processing, several linguistic tag systems have been proposed, such
RNA …..
Syntactic tags
NP
203
209
5
…
… …
end
...
…
10
…
…
…
VP
15
35
…
PP
35
100
…
VP
100
150
…
203 start
209 end
…
...
18
…
…
NP
Tag name
….. DNA
Semantic tags
150
start
…
5
…
PROTEIN
40
80
…
DNA
160
180
…
DNA
200
220
…
RNA
240
260
…
…..
…
…
…
Figure 3. Tag data management in TIMS
as GDA [5], GENIA [6] or GATE [4]. In general, since tagged information is embedded in texts using HTML-like tags, multiple tag information can reduce the visual transparency of the tagged documents. In addition, when we allow the tagging scheme to include structural complexity (e.g. nesting, candidate combinations of syntax and semantic structure, etc.), the overhead of processing the tag information causes inefficient corpus processing. Communication within the TIMS workbench is based on XML-data exchange. TIMS initially parses the XML documents and de-tags them. Then, like in TIPSTER architecture [7], each tag information is stored separately from the original documents and managed by an external database software. This facility allows, as shown in figure 3, different types of tags for the same document to be added.
3.2 Tag- and ontology-based information retrieval Another key feature of TIMS is its facility to search logically for tags, which is implemented via ‘interval operations’. Interval operations (IO) are XML specific text/data retrieval operations, which can be performed over TIMS documents. These operations can be performed over the specified documents and/or tag sets (e.g. syntactic tags, semantic tags, etc.) simultaneously or in batch mode, selecting the documents/tag sets from a list. This accelerate s the process of KA, as users are able to retrieve information from multiple KSs simultaneously.
The main notion of interval operations is that the XML tags are assumed as information that specifies certain intervals in documents. Each interval operation takes two sets of intervals as input and returns a set of intervals according to the specified logical operations. Currently, we define four types of logical operations: • Intersection ‘⊗’ returns intersected intervals of all the intervals given. • Union ‘⊕’ returns merged intervals of all the intersected intervals. • Subtraction ‘y’ returns differences in intervals of all the intersected intervals. • Concatenation ‘+’ returns concatenated intervals of all the continuous intervals. For example, suppose X denotes a set of intervals of manually annotated tags for a document and Y denotes a set of intervals of automatically annotated tags for the same document. The interval operation 1 ((X⊗Y) ⊕{X∪Y}) results in the differences between human and machine annotations (see figure 4).
X ={
}
Y={
}
X⊗Y = {
(TIQL). Using this language , a user can specify the interval operations to be performed on selected documents. The basic expression in TIQL has the following form: SELECT [n-tuple variables] FROM [XML document(s)] WHERE [interval operation] FROM [XML document(s)] WHERE [interval operation] …… where, [n-tuple variables] specifie s table format as output, [XML document(s)] denotes a document to be operated on, and [interval operation] denotes an interval operation to be performed over the corresponding document with variables of each interval to be bound. For example, a typical KA task in our system is achieved by specifying an expression like the following:2 SELECT x1, x2 FROM “paper-1.xml” WHERE ⊗{x1:∪x2:} FROM “paper-2.xml” WHERE ⊗{x1:∪x2:} In this example, TIMS, as shown in figure 5, extracts all the hierarchically subordinate classes matched to
}
⊕{X∪Y}={ nucleic_acid
} EVENT
(X⊗Y) ⊕{X∪Y}={
} paper-2.xml Title Background
Figure 4. (X⊗ Y) ⊕ {X ∪Y}
........ androgen receptor gene
Interval operations are powerful means for data mining from different sources using tag information. In addition, LiLFeS enables tag (interval) retrieval to process not only regular pattern/string matching, but also ontological hierarchy matching to subordinate classes using either predefined or automatically derived term ontology. Thus, semantic similarity based tag information retrieval is achieved. In order to integrate and expand the above components, we have developed a tag information query language
1
‘∪’ denotes a merged set of all the elements.
............... paper-1.xml
EVENT activate bind ...
nucleic acid androgen receptor gene acidmRNA HB-EGF
...
Title Background
........ HB-EGF mRNA
...............
Figure 5. Ontology-based Tagged Information Retrieval
2
‘* ‘ denotes hierarchical matching.
(, ) pair from specified KSs, and then automatically builds a table according to n-tuple order to display the results.
4.
Evaluation
We conducted preliminary experiments on the proposed workbench design. In this paper, we focus on the quality of automatic term recognition and similarity measure calculation via automatically clustered terms to show the feasibility of our proposed term-oriented KA. In addition, we examine the performance of tag manipulation in TIMS compared to string-based XML tag manipulation to show the advantage of the tag information management scheme. The evaluation was performed on the NACSIS AI-domain corpus [9], which includes 1800 abstracts for term recognition. For term clustering and tag manipulation performance, we used GENIA corpus [6] which include s 1,000 MEDLINE abstracts [10], with overall 40,000 (16,000 distinct) semantic tags annotated for terms in the domain of nuclear receptors. We have examined the performance of the C/NC-value method with respect to its precision. The results are shown in figure 6. In the results, we have observed that C-value increases the precision for the first 3 intervals by 5 %, and NC-value by 7 %. They are noteworthy that the precisions of the top 100 interval are more than 93 % for both methods, which are high in quality. The above results show that C/NC-value produces more real terms than pure frequency of occurrence placing them closer to the top of the extracted candidate list. Detailed evaluation for the C/NC-value method term recognition is also presented in [12].
We have used the similarity measure calculation as the central computing mechanism for inferring relevance between XML tags and tags specified in TIQL/interval operation , determining the most relevant tags in the XML-based KS(s). The clustered terms were obtained by using Ushioda's AMI-based hierarchical clustering algorithm [18]. As training data, we have used 1,000 MEDLINE abstracts. Similarities between terms were calculated according to the hierarchy of the clustered terms. In this experiment, we have adopted a semantic similarity calculation method for measuring the similarity between terms described in [15]. Three major sets of classes (namely, nucleic_acid , amino_acid, SOURCE) of manually classified terms from GENIA corpus [6] were used to calculate the average similarities (AS) of the elements with every possible combination of the sets, that is, AS( X , Y ) =
1
0.8
0.6
NC-value
0.5
C-value
0.4
Frequency
0.3 0.2 0.1
-90 0
-10 00
-80 0
-70 0
-60 0
-50 0
-40 0
-30 0
-20 0
0 -10 0
Precision
0.7
Interval
Figure 6. NC-value vs. C-value vs. Freq.
∑
sim ( x , y )
x∈ X , y∈ Y , x ≠ y
where X, Y indicate each set of the classified terms, sim(x,y) indicates similarity between terms x and y, and n indicates the number of possible combinations of terms in X and Y (except in the case where x = y). As the table 1 shows, and as we expected, each AS between the same classes, i.e. AS(nucleic_acid, nucleic_acid ), AS(amino_acid, amino_acid), AS(SOURCE, SOURCE), was greater than the others respectively. We believe that this is feasible enough to use automatic clustered terms as the main source of knowledge for calculating similarities between terms . Table 1. Average Similarities nucleic_acid amino_acid
SOURCE
0.9
1 n
nucleic acid
amino acid
SOURCE
#. of terms
0.498 0.396 0.390
0.492 0.388
0.480
3108 4284 3302
In order to examine tag manipulation performance of TIMS, we measured the processing times consumed for executing an interval operation in TIMS compared to time needed by using string-based regular expression matching (REM). We focused on measuring the interval operation ‘⊗’ with intervals (tags) ‘’ and ‘’ (cf. [6]). The average TIMS processing was about 40-80% faster (depending on number of tags and corpus length) than that of REM. Therefore, we assume that the TIMS tag information management scheme can be seen as an efficient mechanism to facilitate knowledge acquisition process.
5.
Conclusion
In this paper, we examined some problems dealing with IR over large KSs. We have proposed an XML-based integrated KA aid system where we combined tag data management and ontology-based information retrieval. The TIMS workbench allows users to search and combine information from various sources. An important source of information in the system is derived from terminological knowledge; this is provided automatically in XML format where we extract potential terms and term clusters from running text. The system has been tested on 1,000 abstracts of MEDLINE [10]. Additionally, we are conducting experiments using GENIA project’s tag information and ontology. Important areas of future research for the system will involve expanding the scalability of the system to real WWW knowledge acquisition tasks to test and improve the functionality of the system. References [1] Barners-Lee, T. 1998. The Semantic Web as a longuage of logic, http://www.w3.org/DesignIssues/Logic.html [2] Bricklet, D. and Guha, R. 2000. Resource Description Framework (RDF) Schema Specification 1.0, W3C Candidate Recommendation, available at http://www.w3.org/TR/rdf-schema. [3] Frantzi, K. T., Ananiadou, S. and Mima, H. 2000. Automatic Recognition of Multi-Word Terms: the C-value/NC-value method, in International Journal on Digital Libraries, Vol. 3, No. 2, 115—130. [4] GATE, 2001. General Architecture for Text Engineering, http://www.dcs.shef.ac.uk/research/groups/nlp/gate/. [5] GDA, 2001, Global Document Annotation, available at http://www.i-content.org/GDA. [6] GENIA corpus, 2001. GENIA project home page. http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ [7] Grishman, R. 1995. TIPSTER Phase II Architecture Design Document, New York University, available at http://www.tipster.org/arch.htm.
[8] Jacquemin, C., Tzoukermann, E. 1999. NLP for Term Variant Extraction: A Synergy of Morphology, Lexicon and Syntax. In T. Strzalkowski, editor, Natural Language Information Retrieval, pages 25-74, Kluwer, Boston, M. [9] Koyama, T., Yoshioka , M. and Kageura, K. 1998. The Construction of a Lexically Motivated Corpus - The Problem with Defining Lexical Units. In Proceedings of First International Conference on Language Resources and Evaluation, 1015—1019. Granada, Spain. [10] MEDLINE 2001. National Library of Medicine, http://www.ncbi.nlm.nih.gov/PubMed/ [11] Mima, H., Ananiadou, S., Nenadic, G. 2001a. ATRACT Workbench: An Automatic Term Recognition and Clustering of Terms, in Text, Speech and Dialogue TSD2001,Lecture Notes in AI 2166, Springer Verlag [12] Mima, H. and Ananiadou, S., 2001b. An Application and Evaluation of the C/NC-value Approach for the Automatic term Recognition of Multi-Word units in Japanese, in International Journal on Terminology, Vol. 6(2) pp 175-194. [13] Mima, H., Ananiadou, S. and Tsujii, J. 1999. A Web-based integrated knowledge mining aid system using term-oriented natural language processing. In Proceedings of The 5th Natural Language Processing Pacific Rim Symposium, NLPRS'99, pp. 13—18 [14] Miyao, Y., Makino, T., Torisawa, K. and Tsujii, J. 2000. The LiLFeS abstract machine and its evaluation with the LinGO grammar. Journal of Natural Language Engineering, Cambridge University Press, Vol. 6(1), [15] Oi K., Sumita E. and Iida H. 1997. Document Retrieval Method Using Semantic Similarity and Word Sense Disambiguation (in Japanese). Journal of Natural Language Processing, Vol.4, No.3, pp.51-70. [16] Visser, P.R.S., Jones, D.M., Bench-Capon, T.J.M. and Shave, M.J.R. 1997. An Analysis of Ontology Mismatches; Heterogeneity versus Interoperability, AAAI 1997 Spring Symposium on Ontological Engineering, Stanford University, California, USA. [17] Uramoto N. and Takeda K. 1998. Recent Trends in Information Description and Exchange on the Internet, Journal of Japanese Society for Artificial Intelligence, Vol. 13, No. 4. [18] Ushioda A. 1996. Hierarchical Clustering of Words, In Proc. of COLING ’96, Copenhagen.