Choosing an XML database for linguistically annotated ... - CiteSeerX

Choosing an XML database for linguistically annotated corpora Richard Eckart July 30, 2008 Abstract XML has become the de-facto standard for representing linguistically annotated corpora. It seems safe to assume that storing and querying an XML-encoded, annotated corpus in an XML database is a straightforward procedure. In reality, however, it is not. This article aims to provide guidelines for deciding whether to use an XML database and how to choose a suitable product. To this end we examine the following questions: Which aspects should be considered before choosing to store an XML-encoded annotated corpus in an XML database? Which facilities does a database need to provide in order to be suitable for storing and querying annotated corpora? Do current XML databases offer these facilities, and, if not, can they be added?

1 Introduction XML [31] has become the de-facto standard for representing linguistically annotated corpora. Most corpora available today can be obtained in an XML format. Prominent XML formats are those defined by the Text Encoding Initiative (TEI; latest version: [8]) and XCES [15]. Upcoming formats are GrAF [16] and PAULA [10]. Furthermore there are many other XML formats, such as, e.g., TigerXML [17], Nite-XML [5] or MultiX [6]. It may seem safe to assume that storing and querying a corpus encoded in one of these XML formats in an XML database management system (XDBMS) is a straightforward procedure. In reality, however, it is not that simple. This article aims to provide guidelines for deciding whether to use an XDBMS and how to choose a suitable product. To this end we examine the following questions: Which aspects should be considered before choosing to store an XML-encoded annotated corpus in an XDBMS? Which facilities should an XDBMS provide in order to be suitable for storing and querying annotated corpora? Do current XML databases offer these facilities, and, if not, can they be added? This article does not try to provide guidelines for a comprehensive assessment of XML databases in general. It focusses only on those aspects that are relevant in the context of linguistically annotated corpora. Section 2 examines common expectations towards XML databases. Section 3 points out considerations that should be taken into account before choosing an XML database. Section 4 elaborates which facilities an XML database should provide in order to best support linguistically annotated corpora. Section 5 discusses if and how the abovementioned XML databases implement these facilities.

2 Expectations towards using an XML database Favouring an XDBMS over a traditional relational database management system (RDBMS) comes with certain expectations. The three probably most prominent ones

are: to profit from the wide-spread use and interoperability of XML technology, to avoid a data model transformation from the annotation model to the database model – both being the XML model – and consequently to be able to apply XPath [26, 33], XQuery [34] and other XML-related technology to the annotated corpus in a straightforward manner.

2.1 Profit from widespread use and interoperability Expectation 1: A major expectation when encoding annotated corpora in XML is probably to profit from the widespread support for XML and the associated W3C recommendations, such as XPath, XQuery, XSLT [27] or XML Schema [29]. These provide the basis for interoperability between different XML-related tools. The widespread use of XML votes for the existence of stable and performant implementations. Furthermore, XML has become so popular and pervasive that adoption of an XML format with an associated schema is often regarded as a first step towards the long-term preservation of data. A platform based on XML and the associated web-standards is expected to inherit these characteristics of stability, performance and sustainability (cf. [7, 14]).

2.2 Avoid data-model transformation Expectation 2: Another major motivation is probably the desire of avoiding a model transformation between the annotation model and the database model. Databases are defined as a layered architecture with a clear separation between the external, conceptual and internal model [25]. The external model is application-specific. In the case of annotated corpora, the external model – the annotation model – is defined by the annotated phenomena (features) and primary data type (signal type; text, audio, etc.). The conceptual model is the model exposed by the database. In case of a relational database this is a relational model while for an XML database it is the XML model. The query algebra of the database is defined on the conceptual model. The internal model is used internally by the database to physically persist data and to support efficient searching.

2.3 Easily apply existing XML-tools Expectation 3: Yet another expectation is that queries to become very straightforward when the external and the conceptual model are identical because the query algebra works to the annotation model – both being XML. Similarly one expects that tools such as XPath, XQuery or XSLT can be applied to the annotated corpus without any further complications. Ideally, these tools should apply so naturally that a linguist user can immediately employ, e.g., to query and explore a corpus. To allow this, a suitable mapping between annotation constructs and XML constructs is required which allows such a straightforward application of the tools. A simple mapping can be: ●

annotation elements: define the structure of the annotation, e.g. a constituent hierarchy – mapped as XML elements

●

features: information, e.g. parts-of-speech, syntactic categories, that is attached to an annotation element to further define it – mapped as XML attributes

●

signal: the primary data that is being annotated – mapped as XML text nodes

3 Aspects to consider This section examines some aspects to consider before storing a linguistically annotated corpus in an XML database and relates them to the expectations stated previously.

3.1 Document-driven XML vs. data-driven XML An important criterion for choosing an XML database should be whether or not the annotations are encoded in a document-oriented manner. Document-oriented XML is an inline approach inserting annotations in the form of XML tags into a document (cf. figure 1). An annotated document is usually easily human-readable and meaningful even without the annotations. The relative order of annotations is always important. The quick brown fox jumps over the lazy dog.

Figure 1: Example of document-oriented inline XML Data-oriented XML (cf. figure 2) is geared towards machine processing and functions like a database record . When the markup is removed from the document, the result is often useless. The order of elements often is meaningless. Data-oriented XML can be use to encode complex structures as is done, e.g., in GrAF, TigerXML or PAULA.

Figure 2: Example of data-oriented XML The following sections will argue that the expectations given in section 2 can be met for simple document-oriented annotations and that concessions need to be made the more data-oriented constructs are used.

3.2 Preservation of information XML databases should be able to capture all eleven XML information items defined in the XML Infoset [30]. 1 Before choosing a non-XML database as an alternative the preservation of this information needs to be considered. Effectively this means a function that projects the XML Infoset into the target database is required. If the corpus database is limited to read-only access, only the information that needs to be queryable has to be exported. The mapping function may be injective. If the database should offer readwrite access the function needs to be bijective. If items, such as comments or processing instructions, are never used, the mapping may ignore them.

3.3 Expressiveness of the XML data model The limited expressiveness of the XML data model with respect to the constructs commonly used for annotation tasks needs to be thoroughly considered before adopting the 1 Info rmation items: document, element, attribute, processing instruction, unexpanded entity reference, character, comment, document type declaration, unparsed entity, notation and na mespace.

XML database approach. This section examines some layers of indirection that can be introduced to sidestep these limitations. Unfortunately, each of these indirections takes the format further away from document-oriented XML towards data-oriented XML.

3.3.1 Stand-off annotation Using a stand-off approach (cf. [23]), the signal and the annotation are kept separate from each other (figure 3). The annotation is linked to the signal using a number of anchors (in the example the attributes s and e). Each anchor addresses a point within the signal and together they form a segment that identifies a region of the signal. Signal: 'The quick brown fox jumps over the lazy dog.'

Figure 3: Example of document-oriented stand-off XML A stand-off approach has a number of benefits: ●

Multiple structures, i.e. multiple XML documents, can annotate the same signal.

●

Segments can address any kind of data – signals are no longer limited to text.

●

Overlapping regions of a signal can be annotated.

●

Non-continuous annotations can be merged by attaching multiple segments addressing non-continuous regions of the signal to a single annotation element.

●

Crossing edges can be uncrossed by pushing them into the linking between signal and annotation.– The use of segments allows the annotation elements to be reordered without affecting the signal.

But the approach also has a number of drawbacks: ●

Additional effort is necessary to resolve segments to actual signal content.

●

Reordering the annotation elements, e.g. to avoid crossing edges, causes the order of the annotation elements to loose its meaning. An element A may be followed by an element B, but the annotated signal data(A) may actually occur before data(B). The original order may be preserved as a dedicated attribute, but this complicates processing the annotations (e.g. in queries).

3.3.2 Labelled edges Some annotation models, e.g. TigerXML, allow the attachment of features to the edges between annotation elements. Edges, however, are not part of the XML Infoset – they are only implicitly present. XPath exposes edges in the form of axes. One can, for example, navigate along the edge connecting an element to its parent element along the parent axis, but it is not possible to address this edge and to attach an attribute to it. It is possible to explicate edges , e.g. using XLink [28]. This approach encodes edges as XML elements decorated with a set of defined attributes. However, it significantly complicates the application of XPath, XQuery and XSLT, because these do not support the

The brown fox is quick. He jumps over the lazy dog.

Figure 4: Example of a labeled edge using XLink navigation along XLinks in a manner as natural as the navigation along axes. Figure 4 shows an anaphora link of the type pronominalization from He to fox. One would like to be able to navigate across XLinks as if they were axes, as illustrated in the XPath statement in figure 5 which is intended to mean find those tokens pronominalised as 'he' later on. The XLink edge is here used as if it where the name of an axis. To select the kind of edge, it is qualified using a node-test on the type attribute. Unfortunately neither XPath nor XQuery integrate XLinks in such a manner. //token[lowercase(.)='he'] /edge[@type='anaphora/pronominalisation'::token]

Figure 5: Illustration of an integration of XLink into XPath as an axis

3.3.3 Structured attributes XML attributes may not have sub-attributes which prevents them to be used to encode structured features. This, however, is an often desired property of an annotation model. Thus, e.g., TEI-P5 uses an approach that encodes feature structures using XML elements (figure 6). This removes the clear mapping of annotation elements to XML elements and of features to XML attributes that was suggested previously (cf. section 2.3) and consequently complicates the application of XPath, XQuery and XSLT because extra care needs to be taken to exclude feature structures, e.g., when counting a constituent's immediate sub-constituents which are represented by its child elements. love

Figure 6: Example of a TEI-P5 feature structure

3.4 Willingness to compromise Finally one has to consider whether an XDBMS offers any facilities (cf. section 4) that are worth making concessions with respect to the expectations one may initially have had. The previous sections have shown methods of creating an annotation model on top of XML that offers a greater expressiveness that the XML model itself. However, these methods also make the format more data-oriented and consequently using XPath, or XQuery gets more complicated as they no longer apply directly to the annotation model.

//token[ @xlink:label=//edge[ @xlink:arcrole='anaphora/pronominalisation' and @xlink:from=//token[lower-case(.)='he']/@xlink:label ]/@xlink:to ] for

$edge in //edge, $from in //token, $to in //token where $edge/@xlink:arcrole $from/@xlink:label $to/@xlink:label lower-case($from) return $to

= = = =

'anaphora/pronominalisation' and $edge/@xlink:from and $edge/@xlink:to and 'he'

Figure 7: XPath vs. XQuery using where The usage of stand-off annotation, XLinks or feature structures introduces differences between the conceptual model (the XML model) and the external (the annotation model), thus violating expectations 2 and 3. Extending the XPath/XQuery model itself, as for example mounted [2] does by introduces new axes to navigate between annotation elements using their stand-off anchors, makes the queries incompatible with existing XML tools and thus violates expectation 1.

4 Facilities Once a choice for an XDBMS has been made, there are several facilities that a particular product should provide in order to operate well on annotated corpora. To meet expectation 1, the product should be XQuery-compliant. To be performant, it should make use of indexes on the annotation structure as well as on the signal. Finally, support for stand-off annotations should be already provided or should be implementable in XQuery, since this solves the most pressing issues of document-oriented approaches.

4.1 XQuery support In order to fulfil expectation 1, an XML database should comply with the XQuery recommendation, which is the de-facto standard query language for XML databases. Compliance is usually tested using the official W3C XML Query Test Suite [32], but there are also various other benchmarks available (cf. [1]). In the context of annotations especially desirable capabilities of an XQuery engine are optimised where clauses and support for the horizontal axes and user-definable functions.

4.1.1 Optimised where XQuery offers two ways of expressing query constrains. The first is using a node test. The second is the where clause. While many queries can be formulated using only node tests, they can usually be formulated in a more user-friendly manner using a where clause. Figure 7 illustrates the query find those tokens pronominalised as 'he' later on based on the example data from figure 4 once in XPath and once in XQuery using where. Both queries are complex, but the second one is arguably more userfriendly. Comparing these queries to more concise query suggested in figure 5, which treats the XLink as an axis, suggests that XPath and XQuery can still be improved.

4.1.2 Horizontal axes While XQuery is mostly a superset of XPath 1, some lesser-used axes have become optional.2 For annotated corpora, especially the horizontal axes (following, following-sibling, preceding, preceding-sibling) are important as they can be used to navigate between elements in document order, most prominently between tokens. It is also important that these axes are implemented in a performant way because queries involving sequences of tokens (or other constituents) are very common (figure 8). //token[@pos='DT'][following::token[1][@pos='NN']]

Figure 8: Find a determiner followed by a noun using the following axis

4.1.3 User-defined functions XQuery allows the declaration of functions that can be used to create a library of query helpers, for example to find instances of a particular non-annotated concept (e.g., a passive sentence) that can be deducted from annotated features (e.g., part-of-speech tags, sentence boundaries). Functions are also the preferred way of extending XQuery without changing its syntax. For instance, stand-off support can be implemented using functions such as overlapping(a, b) which takes two sets of annotation elements, A and B, and returns all a in A that overlap with some b in B (cf. AnnoLab [11]). A database may also allow external functions that are implemented outside the query environment, e.g., in C or Java. Such functions could be performance-optimised and make use of indexes. Depending on the product they may be more flexible than non-external functions, e.g., they may have access to the context node, or they may be subject to restrictions, e.g., restricted to primitive return values (numbers and strings). For usage with annotated corpora, an XDBMS should support user-defined functions. External functions should also be implementable and offer at least the same possibilities as non-external functions. This allows the implementation of functions in a portable, pure-XQuery way – fulfilling expectation 1 – as well as a non-portable but performance-optimised external function. Note that using functions instead of extending the XQuery syntax can easily result in rather complex queries.

4.2 Stand-off support Adopting a stand-off approach solves the most prominent issues of document-oriented XML annotation models. Therefore support for stand-off annotations should be provided an XDMBS or be implementable as user-defined functions. While stand-off support can be implemented using pure-XQuery functions, a performant implementation is likely to require a specialised index and will need to provide the query optimiser with the information necessary to devise a proper query plan. A database can also support stand-off annotations by adding new XPath axes (cf. section 5.2). While this approach allows formulating queries more naturally, it makes the queries incompatible with other XQuery implementations and thus violates expectation 1.

2 Optional axes: ancestor(-or-self), following, following-sibling, preceding and preceding-sibling.

4.3 Index-based searching Indexes are a central feature of any database. XML databases need to support a structural index which allows fast searches along any of the XPath axes. A database may also offer indexes on attribute values. Such indexes are especially helpful in an annotation context, because queries will almost always involve features. The database should provide a query optimiser which automatically makes use of the available indexes. External functions should be able interact with optimiser so their execution can be properly planned. The indexing mechanism should be implemented in a modular way allowing to incorporate new kinds of indexes, e.g. a temporal index for stand-off annotations or a full-text index. In order to prevent time- and space-consuming over-indexing, indexes may be configurable to index only certain elements or attributes.

4.4 Full-text search XQuery as well as XPath 1 and 2 offer text searching support only via substring matches, mainly through the contains(a, b) function which test if the string a contains the substring b or through the matches(a, b) function which matches a against the regular expression b. This approach is limited. Currently the W3C is working towards a recommendation for full-text search in XQuery and XPath 2 (XQFT) [35] which states three main differences between substring and full-text search: ●

A full-text search searches for tokens and phrases rather than substrings, thus allowing for searches like find all instances of 'XML' and 'Query' with up to 3 tokens in between.

●

A full-text search should support languages-sensitive searches, e.g., using automatic stemming (a search for mouse also finds mice ) or insensitivity to hyphenation (resign also matches re-
sign).

●

A full-text search is not exact., Therefore requires the notion of a result scoring that allows to sort results by their relevance.

Full-text search makes heavy use of tokens and may use other kinds of segmentation as well (phrases, paragraphs, etc.). While for a non-linguist user it is perfectly acceptable that the search engine implicitly segments texts, a linguist may want to decide explicitly where segments start and end as part of the annotation process. It is even possible that a corpus contains multiple different token annotations produced by different tokenisers. A full-text search optimally suited for a linguist user should support indexing based on user-defined segmentations. It should also allow to select at query time which segmentations should be used. It should be possible to select at query time which generalisations (stemming, synonyms, etc.) are applied and which resources (e.g. thesaurus) are used. If a stand-off approach is used, the full-text index should be aware that the signal and annotation are stored separate. Finally it should be possible to mix structural constraints with textual constraints, e.g. in a bottom-up way navigating from the signal to its associated attributes. Figure 9 illustrates this with a query in pseudo-XQFT syntax asking for sentences containing a determiner followed by the word 'fox' with at most one intervening token. The same query would be possible using traditional substring matching, however, it would be more complicated. One would also expect that the full-text index

Signal: 'The quick fox.' //sentence ftcontains( '.*'[@pos='DT'] ftand 'fox' with wildcards ordered distance at most 1 word)

Figure 9: Example of accessing annotations from tokens is used in the given example while it would probably not be used by an equivalent traditional query.

5 Support This section examines whether eXist [18] with the AnnoLab extensions [12], MonetDB/XQuery [4, 2, 19] as well as Galax/GalaTex [13, 9] can provide the facilities given in section 4. The products have been chosen because they implement XQuery, they are freely available and they support features like stand-off annotations and fulltext search – though not necessarily at the same time. An overview over other XQuery implementations in general, their capabilities and how they fit in can be found in [22].

5.1 eXist/AnnoLab The eXist XML database is a pure-Java product which offers a high compliance with XQuery, XPath 2.0 and XSLT. Version 1.2.4 supports XUpdate and the XMLDB API [37] and support for the XQuery Update Facility (XQUF) [36] and the XQuery API for Java (XQJ/JSR 225) [21] is under development. The horizontal axes and all other optional axes are supported. However, it should be noted that the query given in figure 8 runs very slow on eXist. Apparently the statement following::token[1] causes the full list of all tokens on along the following axis to be collected and only after that the first item from the list is selected and returned. eXist provides configurable indexes that are used automatically if query-rewriting is enabled. At the time of writing, there is no query optimiser using index statistics to devise an efficient query plan. However, a modular indexing framework is present which allows the integration of new types of indexes and work is done towards an API by which a future query optimiser can obtain index statistics. A full-text search engine is available. eXist offers non-standard operators as well as functions that use the full-text index. Navigation from the signal into the XML structure, is not possible (cf. figure 9). A stemmer based on Porter's algorithm [20] is available and can be enabled or disabled. A default tokeniser is provided and can be replaced with a custom implementation. Both features are configurable in a configuration file. AnnoLab [12] provides additional functions for working with stand-off annotations. These functions allow to relate nodes to each other based on their stand-off anchors, e.g. by testing for overlap or containment. Usually, data from different annotations on the

same signal is selected, e.g. from a topical field annotation as result A and a part-ofspeech annotation as result B, and then the results are filtered, e.g. by testing whether there is overlap or containment between the results in A and B (see the function overlapping(a, b) in section 4.1.3). The functions are available in a pureXQuery implementation and as optimised external functions implemented in Java.3 Currently there is no index supporting these functions.

5.2 MonetDB/XQuery MonetDB is a relational database kernel which serves as a base for PathFinder XQuery front-end [4]. The combination aims to be a high-performance XQuery engine. Updates are supported in compliance with XQUF.4 No results on the XQTS benchmark are available, thus there is no XQuery-compliance score given in table 1. However, results of the XMark benchmark are available on the project's homepage 5 as well as a comparison against other XDBMSs including Galax (section 5.3) and eXist (section 5.1). XML documents that are stored in MonetDB/XQuery are shredded (decomposed into relational records that are stored using the relational kernel) and thus become indexed. PathFinder compiles XQuery statements into an optimised query plan which is then further optimised by the relational database core. The PF/Tijah module [19] adds full-text search to MonetDB/XQuery, but does not conform to XQFT. The stemmer can be configured when first creating an index. 6 Multiple indexes with different configurations can be maintained simultaneously. The full-text index is accessed through functions. Queries are formulated using the Narrowed Extended XPath I (NEXI) [24] language and passed to these functions as a string argument. NEXI is an XPath variant that is limited to the descendant step and adds the about(n, t1 ...ti ) function which is used to search for the terms t1 ... t i under the node n. Navigating from the signal into the XML structure is not possible. MonetDB/XQuery comes with a stand-off extension module [2] which provides four additional axes: select-narrow, select-wide, reject-narrow and reject-wide. These can be used to select nodes based on their start and end attributes. Figure 10 shows an example taken from [3] which illustrates the use of these axes. The query Q1 selects all nodes that overlap with the span of B while Q2 selects the nodes that contain the span of B. Q1 : //B/select-wide::* Q2 : //*[./select-narrow::B]

Result: A, B, C, E Result: A, E

Figure 10: Stand-off annotation in MonetDB/XQuery 3 4 5 6

At the time of writing, the pure-XQuery and the Java implementation are out of sync. At the time of writing complying with the July 2006 draft version of the recommendation. http://monetdb.cwi.nl/projects/monetdb/XQuery/Benchmark/index.html The manual states that the tokeniser and stop-wo rds are configurable as well, but the author was not able to find the corresponding configu ration options.

5.3 Galax/GalaTex Galax [13] is an OCAML-based implementation of XQuery. It was originally intended as a reference implementation of XQuery and consequently implements most of the recommendation. Its functionality has grown beyond its original goals. Galax 1.1 supports XPath 2.0, XQUF and work has been done on XML Schema validation support and query optimisation. Galax optionally supports persistent indexes using the Jungle storage manager which in turn used Berkeley DB as a storage backend. The query engine includes a query rewriter which simplifies queries in order to prepare them for the algebraic query optimiser. GalaTex [9] is an extension to the Galax XQuery engine. It is intended as a reference implementation of XQFT. Currently GalaTex is not performance optimised. The XQFT primitives are implemented as XQuery functions and the syntax extensions are internally transformed into a pure XQuery 1.0 statement which in turn is handed over to the Galax query engine. A web-based interface to a demo database is available on the GalaTex homepage.7 Here one can experiment with the capabilities of the upcoming XQFT recommendation based on various pre-written queries or using custom queries. It should be pointed out here that the query suggested in figure 9 is, unfortunately, not XQFT compliant. The recommendation does not support navigating from the signal into the XML structure and therefore may be inconvenient for querying linguistic annotations. However, the fulltext search capabilities provided by GalaTex still can prove useful for corpus exploration involving document or corpus structure.

6 Conclusion The choice of storing an XML-encoded corpus in an XML database comes with certain expectations. We have considered various issues complicating the processing of a corpus stored in an XML database. Taking these issues into account, we discussed that it is reasonably possible to meet expectation 1 – interoperability. However, meeting expectations 2 and 3 – avoiding model transformation and naturally applying existing XML technology – is only possible if the corpus annotation format is extremely simple (cf. section 2.3). As soon as workarounds are introduce to sidestep the limited expressiveness of the XML model (see section 3.3) concessions need to be made, in particular with respect to the complexity of queries. An XML database should be used for linguistically annotated corpora only after considering thoroughly which concessions one is willing to make with respect to the initial expectations. We find that XML databases can be used for if the users are experts with XML technology, in particular with XQuery, or if considerable effort is taken to provide simplify the querying process through using a comfortable query frontend (cf. [7]). In such a case the MonetDB/XQuery database currently seems to offer the best compromise. It does not support XQFT and supports stand-off annotations through additional axes, making the queries not portable, but it supports stand-off annotations and full-text search using persistent indexes and a query optimiser. The AnnoLab/eXist approach using XQuery functions to support stand-off annotation should be considered if interoper7 http://www.galaxquery.com/galatex/demo/galax demo.html

ability is an issue as aims to be more portable. Galax/Galatex is interesting in order to evaluate how full-text search and XML data can interact. Due to the limitations of XML discussed in this article, we still see a general demand in the linguistic community for a format for linguistically annotated corpora ●

that is standardised and widely accepted and used and thus provides for multiple implementations and their interoperability;

●

that captures all constructs used in linguistic annotation and exposes them in the associated encoding format, a query language, APIs, etc. and thus does not mandate a model transformation when exchanging data between different parties or when storing the data in an annotation database;

●

that comes with a complete tool-set comparable to the tools existing for XML: parser, validator, transducer, query engine, data management system, etc.

Acknowledgements The author wants to thank Georg Rehm and Mônica Holtz for valuable discussions during the writing of this article. Facility Version XQuery support – XQTS Minimal Conformance Score – Horizonzal axes – Persistant indexes — Automatically used — Configurable Full-text query – Flavour – Stemmer – Tokeniser – Stop-words – Persistent index – Multiple indexes – Navigate from tokens into structure XUpdate XQuery Update facility Stand-off support – Persistant index Query optimizer – Optimized WHERE User-defined functions – External functions – Module support

eXist/ AnnoLab 0.5.1RC4/1.2.4 yes 99.4% yes yes yes yes yes eXist native fixed (Porter) configurable configurable yes no no yes in preparation yes (functions) no rewriting no yes yes (Java) yes

Table 1: Database facilities by product

MonetDB/ XQuery Jun 2008 yes yes yes yes no yes PF/Tijah (NEXI) configurable configurable configurable yes yes no no yes (July 2006) yes (axes) yes algebraic yes yes yes (C) yes

Galax/ GalaTex 1.1/0.5.1 yes 99.4% yes optional (Jungle) yes no yes XQFT fixed (Porter) fixed fixed no no no no yes no algebraic yes yes yes (OCAML) yes

References [1] Loredana Afanasiev and Maarten Marx. An analysis of XQuery benchmarks. Information Systems, 33(2):155–181, 2008. [2] Wouter Alink, Raoul Bhoedjang, Arjen de Vries, and Peter Boncz. Efficient XQuery support for stand-off annotation. In Proceedings of the International Workshop on XQuery Implementation, Experience and Perspectives (XIME-P 2006), Chicago, IL, USA, June 2006. [3] Wouter Alink, Valentin Jijkoun, David Ahn, Maarten de Rijke, Peter Boncz, and Arjen de Vries. Representing and Querying Multi-dimensional Markup for Question Answering. In Proceedings of the 5th Workshop on NLP and XML (NLPXML), in conjunction with EACL, Trento, Italy, April 2006. [4] Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger, and Jens Teubner. MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 479–490, New York, NY, USA, 2006. ACM. [5] Jean Carletta, Jonathan Kilgour, Tim O'Donnell, Stefan Evert, and Holger Voormann. The NITE Object Model Library for handling structural linguistic annotation on multimodal data sets. In Proceedings of the EACL Workshop on Language Technology and the Semantic Web (3rd Workshop on NLP and XML, NLPXML-2003), Budapest, Hungary, 2003. [6] Noureddine Chatti, Sylvie Calabretto, and Jean-Marie Pinon. Encoding and querying multi-structured documents. In Bob Martens and Milena Dobreva, editors, Proceedings ELPUB2006 Conference on Electronic Publishing, pages 237–246, Bansko, Bulgaria, June 2006. [7], Georg Rehm Richard Eckart, and Christian Chiarcos. An OWL- and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2007, Borovets, Bulgaria, 2007. [8] TEI Consortium. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Guidelines, TEI, November 2007. URL: http://www.tei-c.org/Guidelines/P5/ [9] Emiran Curtmola, Sihem Amer-Yahia, Philip Brown, and Mary Fernández. GalaTex: a conformant implementation of the XQuery full-text language. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 1024–1025, New York, NY, USA, 2005. ACM. [10] Stefanie Dipper, Michael Götze, Uwe Küssner, and Manfred Stede. Representing and querying standoff XML. In Georg Rehm, Andreas Witt, and Lothar Lemnitzer, editors, Data Structures for Linguistic Resources and Applications – Proceedings of the Biennial GLDV Conference 2007, pages 337–346, Tübingen, Germany, April 2007. Gunter Narr Verlag. [11] Richard Eckart. Towards a modular data model for multi-layer annotated corpora. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 183–190, Sydney, Australia, July 2006. Association for Computational Linguistics.

[12] Richard Eckart and Elke Teich. An XML-based data model for flexible representation and query of linguistically interpreted corpora. In Georg Rehm, Andreas Witt, and Lothar Lemnitzer, editors, Data Structures for Linguistic Resources and Applications – Proceedings of the Biennial GLDV Conference 2007, pages 327–336, Tübingen, Germany, April 2007. Gunter Narr Verlag Tübingen. [13] Mary F. Fernández and Jérôme Siméon. Growing XQuery. In Luca Cardelli, editor, ECOOP, volume 2743 of Lecture Notes in Computer Science, pages 405–430. Springer, 2003. [14] Georg Rehm, Richard Eckart, Christian Chiarcos, and Johannes Dellert. Ontologybased xquery'ing of XML-encoded language resources on multiple annotation layers. In European Language Resources Association (ELRA), editor, Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Marrakech, Morocco, May 2008. [15] Nancy Ide, Patrice Bonhomme, and Laurent Romary. XCES: An XML-based encoding standard for linguistic corpora encoding standard for linguistic corpora. In Proceedings of the 2nd International Language Resources and Evaluation Conference, Paris, 2000. ELRA. [16] Nancy Ide and Keith Suderman. GrAF: A graph-based format for linguistic annotations. In Proceedings of the Linguistic Annotation Workshop, pages 1–8, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [17] Esther König and Wolfgang Lezius. The TIGER Language - A Description Language for Syntax Graphs. University of Stuttgart, Institut für Maschinelle Sprachverarbeitung (IMS), April 2003. [18] Wolfgang Meier. eXist: An open source native xml database. In Akmal B. Chaudhri, Mario Jeckle, Erhard Rahm, and Rainer Unland, editors, Web, Web-Services, and Database Systems, volume 2593 of Lecture Notes in Computer Science. Springer, 2003. [19] Vojkan Mihajlovic. Score Region Algebra: A flexible framework for structured information retrieval. Phd thesis, University of Twente, Centre for Telematics and Information Technology, December 2006. [20] Martin F. Porter. An algorithm for suffix stripping. Readings in information retrieval, pages 313–316, 1997. [21] Java Community Process (JCP) program. JSR 225: XQuery API for Java (XQJ). Java Specification Requests (JSR) – proposed final draft, Java Community Process (JCP) program, November 2008. URL: http://jcp.org/en/jsr/detail?id=225. [22] Liam Quin. Communicating Query: Where XQuery Fits and How to Implement It. In Extreme Markup Languages 2007®: Proceedings, Montréal, Québec, August 2007. [23] Henry S. Thompson and David McKelvie. An hyperlink semantics for standoff markup of read-only documents. In Proceedings of SGML Europe 97, Barcelona, Spain, May 1997. [24] Andrew Trotman and Börkur Sigurbjörnsson. Narrowed Extended XPath I (NEXI). Advances in XML Information Retrieval, pages 16–40, 2005.

[25] Dennis Tsichritzis and Anthony Klug. The ANSI/X3/SPARC DBMS framework report of the study group on database management systems. Information Systems, 3(3):173–191, 1978. [26] W3C. XML Path Language (XPath) Version 1.0. W3C recommendation, W3C, November 1999. URL: http://www.w3.org/TR/1999/REC-xpath-19991116. [27] W3C. XSL Transformations (XLST). Version 1.0. W3C recommendation, W3C, November 1999. URL: http://www.w3.org/TR/1999/REC-xslt-19991116. [28] W3C. XML Linking Language (XLink) Version 1.0. W3C recommendation, World Wide Web Consortium, June 2001. URL: http://www.w3.org/TR/2001/REC-xlink20010627/. [29] W3C. XML Schema parts 0, 1 and 2. W3C recommendation, W3C, October 2001. URLs: http://www.w3.org/TR/. [30] W3C. XML Information Set (Second Edition). W3C recommendation, W3C, Feburary 2004. URL: http://www.w3.org/TR/xml-infoset/. [31] W3C. Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C recomendation, W3C, September 2006. URL: http://www.w3.org/TR/2006/REC-xml-20060816/. [32] W3C. XML Query Test http://www.w3.org/XML/Query/test-suite/.

Suite

1.0.2,

2006.

URL:

[33] W3C. XML Path Language (XPath) Version 2.0. W3C recommendation, W3C, November 2007. URL: http://www.w3.org/TR/2007/REC-xpath20-20070123/. [34] W3C. XQuery 1.0: An XML Query Language. W3C recommendation, W3C, January 2007. URL: http://www.w3.org/TR/2007/REC-xquery-20070123/. [35] W3C. XQuery and XPath Full Text 1.0. W3C candidate recommendation, W3C, May 2008. URL: http://www.w3.org/TR/2008/CR-xpath-full-text-10-20080516/. [36] W3C. XQuery Update Facility 1.0. W3C candidate recommendation, W3C, March 2008. URL: http://www.w3.org/TR/2008/CR-xquery-update-10-20080314/. [37] XML:DB Initiative. XML:DB Database API. Working draft, XML:DB Initiative, September 2001. URL: http://xmldb-org.sourceforge.net/xapi/xapi-draft.html.

Choosing an XML database for linguistically annotated ... - CiteSeerX

Choosing an XML database for linguistically annotated ... - CiteSeerX

Suggest Documents

Community Standards for Linguistically-Annotated Resources

Natural Vision XML Database - CiteSeerX

Guide to Choosing & Adapting Culturally and Linguistically

An XML-based representation format for syntactically annotated corpora

Linguistically Annotated Data Sets for the Polish English Machine ...

RNABase: an annotated database of RNA structures - CiteSeerX

A linguistically annotated reordering model for BTG-based statistical

ADM: An Active Deductive XML Database System

Relational Database Preservation through XML modelling - CiteSeerX

'Db4XML' Native XML Database System - CiteSeerX

Converting Legacy Relational Database into XML ... - CiteSeerX

Creation of an Annotated German Broadcast Speech Database for ...

Choosing an Optimal Database for Protein Identification from Tandem ...

XML-Based Support for Database Histories and

XML Database Support for Program Trace Visualisation

Database and Information Retrieval Techniques for XML

Developing a Linguistically Annotated Corpus of ... - IEEE Xplore

Logics For XML - CiteSeerX

XML for Visualization - CiteSeerX

UPX: A New XML Representation for Annotated ... - Semantic Scholar

On Database Theory and XML

ULoad: Choosing the Right Storage for your XML ... - VLDB Endowment

An annotated database of Arabidopsis mutants of ... - Semantic Scholar

Checklist for Choosing an Issue