Extracting and Converting Data from Semistructured ...

Extracting and Converting Data from Semistructured Biological Databanks with SRS Thierry Coupayey [email protected] (Contact Author Fax:+33 (0)4 76 63 34 58) Thure Etzoldz [email protected] IMAG-LSR, University of Grenoble, Actimart, Bat. 8, Avenue de Vignate 38610 Gieres, France. z EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K. y

Abstract One fundamental property underlies most biological databanks:

their availability in text format. We propose an approach to retrieve and convert biological data stored in textual at les into information in formats more suitable for further analysis and use by client applications. Extracted data can be exported into dierent data structures (CORBA objects, DBMS relations or objects, C structures, HTML reports). These data structures are loader objects. They are generated based on loader speci cations which specify how to get speci c pieces of information and how to package them. This article focuses on CORBA loaders. There is a growing interest for CORBA in the Bioinformatics, and more generally the Life Science Research community, because CORBA is seen as a unifying framework that could help the integration of heterogeneous data sources and applications using these data.

Keywords Semistructured Textual Data, Data Extraction, Data Conversion, CORBA Wrappers, Bioinformatics.

The work presented here was done while the author was a postdoctoral fellow in the SRS group at the European Bioinformatics Institute (EMBL-EBI), Cambridge, U.K.

1

1 Introduction Database Management Systems (DBMSs) are very powerful at managing large collections of inter-dependent information. They are able to model, store and handle consistently, eciently and selectively large amounts of data accessed simultaneously by numerous users. The data dealt with is highly structured through the notion of schema. However, far from all electronic data are stored in DBMS. A lot of it is continuously produced and published outside DBMS on the World-WideWeb (web) or other textual sources. Biological and geographical databanks, scienti c experiment reports, machine or network control reports are just some examples. In the Molecular Biology eld, textual data is the most common. All molecular biology databanks are available as text les and ASCII is the de facto standard for data exchange. Also, almost all applications (sequence alignment methods, etc.) in molecular biology (DNA or protein sequence alignment programs, etc.) take text as input and produce text as output. Some of these data are completely unstructured (images, videos, raw texts). Some others are well structured and thus can/are stored in conventional DBMSs. Others fall in the middle. They are almost structured but they may be incomplete, irregular, they may contain errors or redundancies. Also, they are implicitly structured, i.e., they are intrinsically structured but this structure is not explicit (no machine readable data schema) and therefore not exposed to the user or applications. Also this structure may evolve rapidly. These data are referred to as semistructured data [Abi97]. These are the kind of data we are interested in. We have developed a Loading System for dynamically retrieving and converting semistructured data stored in textual at les into formats more suitable for further analysis and use. Our system allows users to query databanks, to convert and make available the extracted information in various formats in a very exible way. This work presented here is part of the SRS project (see W3 server at http://srs6.ebi.ac.uk). The SRS system [EA93a, EA93b, EUA96, CCKE98], developed since 1990 at the European Molecular Biology Laboratory (EMBL) and then at the European Bioinformation Institute (EBI), is one of the most widely used systems in Bioinformatics and Molecular Biology to access biology related information stored in textual at les. At present, SRS envelopes approximately 250 databanks worldwide at 35 public sites in 24 countries. Databanks can be arbitrary large. The EMBL databank [RTSCF96] for instance contains about 3.3 millions of entries describing nucleotide (DNA, RNA) sequences. Around 10,000 accesses to SRS are made each day, making 2

it the most popular bioinformatics service at the EMBL-EBI ever. SRS Server

Client (1) Query

Loading System Token

(2) Extraction

Server Text

Application

(3) Tokens Loader Generator

(4) Conversion

Semi-Structured Textual Data Loader Objects

Loader Specifications

Figure 1: Loaders Generation for Data Extraction The complete process of information extraction and conversion is referred to as loading. Figure 1 illustrates how it works: the Token Server is responsible for the information extraction. It utilizes a parsing and indexing mechanism to cross-reference entries of dierent databanks and provides as an output tokens or words which represent separate pieces of information which are named. For instance, a token seq would contain a string representing a biological sequence (DNA, RNA, protein). These pieces are separate but not not isolated. The Token Server permits simultaneous access to dierent databanks and can create linked data: tokens are linked to other tokens from other databanks. the Loader Generator makes the information conversion possible. It generates data structures in dierent forms (CORBA objects, DBMS schemas, C types) and associated operations (or load les) to populate these data structures (i.e. load data) with the extracted information obtained dynamically from the Token Server. We call these data structures loader objects. They are central to our approach. Loader 3

objects are generated by the Loader Generator based on loader speci cations. Loader speci cations express how to get the data from the Token Server and how to package it. The Token Server (parser) and the Loader Generator both have inputs expressed in Icarus. Icarus is a general purpose scripting language developped within the SRS project (1) for expressing the structure of the textual databanks which can be seen as a schema used to express queries, (2) for expressing the annotated grammars used by the Token Server (parser) to extract data (and possibly making transformations on it) and (3) for expressing loader speci cations used by the Loader Generator to convert and export the extracted data. Due to lack of space, we cannot describe more deeply Icarus here. Interested readers may refer to [EUA96] (see also examples of Icarus grammars in the appendix). The rest of this paper is organized as follows. Sections 2 and 3 detail how the information is extracted and converted. Section 4 describes our CORBA test case. Section 5 introduces some related works and underlines some originalities of our approach. Finally, Section 6 concludes this article.

2 Extracting Information This section describes the data extraction mechanism in our system by showing an example involving two real biological databanks. We rst introduce these two databanks to show what kind of data our system can deal with. We then describe the data extraction process itself and we underline some characteristic features of our extraction process.

2.1 Semistructured Data

The two databanks we use as examples are ENZYME and PROSITEDOC. The ENZYME databank [Bai93] can be useful to anybody (biologists, chemists, etc.) working with enzymes and can be of help in the development of computer programs involved with the manipulation of metabolic pathways. The ENZYME databank is a sequence of entries. Each entry has dierent data elds: a name which is used as an identi er, an alternative name, a general description, a description of the catalytic activity of the enzyme, a comment and a link to the PROSITEDOC databank. Only the identi er is mandatory, other elds are optional. A typical entry would look like: ID DE

1.1.1.8 GLYCEROL-3-PHOSPHATE DEHYDROGENASE (NAD+).

4

CA CC CC PR DR DR //

SN-GLYCEROL 3-PHOSPHATE + NAD(+) = GLYCERONE PHOSPHATE + NADH. -!- ALSO ACTS ON 1,2-PROPANEDIOL PHOSPHATE AND GLYCERONE SULFATE (BUT WITH A MUCH LOWER AFFINITY). PROSITE; PDOC00740; Q00055, GPD1_YEAST; P41911, GPD2_YEAST; P34517, GPDA_CAEEL; P52425, GPDA_CUPLA; P13706, GPDA_DROME; P07735, GPDA_DROVI;

The PROSITEDOC databank is in fact a part of the PROSITE databank. PROSITE [BB94] is designed to help determine the function of uncharacterized proteins translated from genomic or DNA sequences. It consists of a database of biologically signi cant sites, patterns and pro les that help to reliably identify which known family of proteins (if any) a new sequence belongs. PROSITEDOC entries annotate PROSITE entries and can also be seen as a separate databank. A PROSITEDOC entry has an identi er, a description eld, an authors list and possibly links to PROSITE and ENZYME entries. A PROSITEDOC entry is shown below. Note that a PROSITEDOC entry is quite dierent from an ENZYME entry. ENZYME entries are separated by the symbol '//' and each data- eld is \tagged" by symbols like 'ID' for the identi er eld, 'DE' for the description eld, etc. The structure of PROSITEDOC entries is much less obvious. PROSITEDOC is much more natural language like. There is no tag to either separate entries or to introduce data- elds: {PDOC00740} {PS00957; NAD_G3PDH} {BEGIN} ************************************************************** * NAD-dependent glycerol-3-phosphate dehydrogenase signature * ************************************************************** NAD-dependent glycerol-3-phosphate dehydrogenase (EC 1.1.1.8) the reversible reduction of dihydroxyacetone phosphate phosphate. It is a eukaryotic cytosolic homodimeric protein of a signature pattern we selected a glycine-rich region that involved in NAD-binding.

(GPD) catalyzes to glycerol-3about 40 Kd. As is probably [1]

-Consensus pattern: G-[AT]-[LIVM]-K-[DN]-[LIVM](2)-A-x-[GA]-x-G-[LIVMF]-x[DE]-G-[LIVM]-x-[LIVMFYW]-G-x-N -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Last update: November 1997 / Pattern and text revised. [ 1] Otto J., Argos P., Rossmann M.G. Eur. J. Biochem. 109:325-330(1980).

5

ENZYME and PROSITE are protein sequence related databanks. The core databanks in molecular biology are related to nucleic acid (DNA and RNA) sequences and protein sequences. The rst databanks appeared in the 1980s and their size has been estimated to double every one or two years in recent years, and is predicted to continue at this rate into the next millennium. Today there are hundreds of such databanks available. Furthermore, a large amount of information related to these kind of data is also stored in semistructured databanks. This includes bibliographic sources, 3D structures, chemical properties, genetic and physical genomic maps, taxonomic schemas, experiment listings, etc. We have mentioned these data repositories because they belong to our primary technical domain of expertise. It is very important, however, to underline that our appoach can be transpose in other domains than biology. We are basically interested in all data that can be incomplete, redundant, irregular or even erroneous and which is given in a textual format which often has an intrinsic structure although this structure is not explicitly provided as such { for instance through the notion of schema in DBMSs.

2.2 The Data Extraction Process

The data extraction process is depicted on Figure 2. We continue with our running example and explain how data can be extracted from our two databanks. ENZYME Grammar PROSITEDOC Grammar

ENZYME Data

Token Server PROSITEDOC Data

(Parser)

Token Tables

Figure 2: The Data Extraction process As often when dealing with textual data, our approach is based on a pars6

ing and indexing mechanism. Data repositories, ENZYME and PROSITEDOC in our example, are described by context-free grammars which represent databanks as sequences of entries. Each entry is in turn composed of dierent data elds. The contents of the data elds are parsed, selected words or tokens are isolated and inserted into an index. There is generally a separate index for each data eld. Indexes are organized as B+ Trees (B+ Trees are an extension of B-Trees in which actual values are represented in the leaves only). Interested readers may nd in the appendix the grammars de ned the ENZYME and PROSITEDOC databanks. The actual parsing is done by the Token Server. The Token Server is based on Icarus (see http://srs.ebi.ac.uk/srs5/man/srsman.html) which is both the language used to specify the structure of the data and its syntax. In contrast to the LEX and YACC compilers which are widely used for parsing purposes, Icarus is an interpreter that combines lexical and syntactical de nitions. Grammars not only allows the Token Server to \recognize" entries, they can also express some actions that can be carried out during the parsing process. To produce output, any terminal or non-terminal can be associated with one or more dierent action commands like: create a token, extend (add text to) existing tokens, set some global states of the parsing process, input/output directives, print commands, variable assignments and function calls. Parsers traditionally \parse all they can" (by recursive descent in our case), i.e., they decompose the input data starting with the root production de ned in the grammar and go on recursively until they have only terminals. We refer to this scheme as forced parsing in opposition to a lazy parsing by which our Token Server only parses the production(s) that it is asked to parse { plus the production(s) that it needs in order to do so. In our example (see also the grammar given in appendix), if one would ask for the token 'id' for the databank PROSITEDOC, the parser would only go through the productions it needs, i.e., 'entry' and ' elds'. It would not parse the description eld, the alternative name eld, etc. Lazy parsing is probably the most distinguishing feature of our Token Server (together with random token access). It allows the Token Server to reply to client application requests in a very ne-grained manner by dynamically extracting speci c pieces of data and providing them as tokens in run-time structures { while the actual data sources are kept unchanged: the data is always stored in textual les.

7

3 Converting Information Data conversion is the means by which data extracted by the Token Server is made available to client applications. The conversion process is depicted by Figure 3. The process is based on loader speci cations. A loader speci es the mapping between entries from a databank and data structures provided to the client applications. The Loader Generator takes as input some loader speci cations and generates as output data structures and associated operations to populate these data structures. Loader Specifications

Loader Generator

Token Tables

Loader Objects

Figure 3: The Data Conversion process The loader generation involves the generation of : static data structure de nitions: database schemas, CORBA IDL schemas, C types; operations: load les, CORBA object implementations, C functions. These operations actually load the data by dynamically accessing tokens provided by the Token Server (and thus eventually extract data from the databanks). In other words, the Loader Generator does not generate objects that actually contain the data but rather data container patterns that know how to load the data from the Token Server when they are asked to do so. We focus now on loader speci cations which specify how to package the extracted information. The main features of loader speci cations are the following: 8

Several loader speci cations can be provided for one databank, providing dierent views on this databank. For each databank, it is possible to declare a loader as its default loader. Furthermore, the system provides an absolute default loader, the Basic Loader, which can be used to load any entry of any databank. In contrast, loaders can have input from several databanks: a general loader can be used for several databanks that are related (for example biological sequence databanks) providing a uni ed view of databanks that are \close" to one another. Loaders can inherit from one another. All loaders inherit from the absolute default loader { but one can also de ne more speci c loaders by using inheritance. Loaders can de ne foreign attributes: a loader wraps one databank but it can get some information from one or several other linked databanks. It is possible to de ne if the attribute is single or multi valued. Loaders can de ne composition links (relationships): a loader object can have a link (relation) to another loader object. This enables the creation of object graphs. It is again possible to de ne the cardinality of this link (single or multi valued). Loaders can compute data: a function can be associated to a loader attribute so that its value is computed (instead of getting it from the databank). The language used for expressing loader speci cations is Icarus [EUA96]. In fact, both loader speci cations and meta-loader speci cations (which express what can be done in a loader speci cation) are expressed in Icarus. A loader de nition consists of dierent attributes such as the name of the loader, the list of databanks that the loader wraps and the inheritance clause. Only the name of the loader is mandatory. This is followed by a list of loading attribute de nitions which is the central part of the loader de nition. Each loading attribute can be seen as a pair in which: attribute speci cation is the de nition of the exported attribute. It consists in the name of the attribute, its type (string, int, real, objects, etc.) and its cardinality (single or multi valued).

9

loading speci cation de nes how to get the data from the Token Server in order to make it available as an attribute value. It speci es mainly the token name and optionally the name of the databank, or group of databanks (if several tokens with the same name exist for several databanks); if a link operation has to be performed to obtain the entry containing the requested token (if the token comes from another databank); or an instruction (Icarus code) to compute the value of the attribute. We give hereafter the de nition of two loaders. The rst is not intended to be very realistic but to show most of the capabilities of loaders. The second is more realistic. It corresponds to the ENZYME databank we have presented earlier.

BioSeq_Class:$LoadClass:[BioSeq attrs:{ # Input from multiple databanks $LoadAttr:[AccNumber type:string load:{ $Tok:[acc from:@EMBL_DB] $Tok:[accno from:@SWISSPROT_DB]}] # The sequence attribute $LoadAttr:[Sequence type:string isSeq:@PROTSEQ_DATA] # Foreign attribute $LoadAttr:[MedlineCit type:string load:{$Tok:[cit link:medline]}] # Composition links $LoadAttr:[Taxa type:object card:multi class:@Taxonomy_Class] # Computed attribute $LoadAttr:[SeqType type:string load:{ $Tok:[value:'DNA' from:@EMBL_DB] $Tok:[value:'Protein' from:@SWISSPROT_DB]}] } ] Enzyme_Class:$LoadClass:[Enzyme

10

libs:@ENZYME attrs:{ $LoadAttr:[AltName type:string load:$Tok:i_altnam] $LoadAttr:[Description type:string load:$Tok:i_des] $LoadAttr:[CatalycActivity type:string load:$Tok:i_catact] $LoadAttr:[Comments type:string card:multi load:$Tok:cc] $LoadAttr:[Prositedocs type:object card:multi link:prositedoc loader:@Prositedoc_Class] } ]

The Icarus object BioSeq Class de nes a loader called BioSeq. This loader speci cation de nes 5 attributes: AccNumber, Sequence, MedlineCit, Taxos and SeqType. The loader may have input from multiple databanks: it can be used to load entries of the EMBL [RTSCF96] or SWISSPROT [BA97] databanks. If the actual entry comes from EMBL, the value of the attribute AccNumber (AccessionNumber) will be set by getting the token acc from the Token Server. If the entry comes from SWISSPROT, then the token accno will be used. Then the attribute Sequence is de ned. This attribute will eventually provide the actual biological sequence (IsSeq is a special tag. It indicates that the value of the attribute has to be retrived not by accessing a token table but by an other mechanism (sequences are treated separately from other attributes). The loader speci cation also de nes a foreign attribute: MedlineCit. Medline is a databank of literature of research in medicine and biology (cf. http://medline.cos.com/). The value of MedlineCit will be a collection of bibliographic citations related to the sequence. Citations are obtained by getting the token cit from Medline using a link to the Medline databank. The attribute Taxa de nes a composition link. The value of the attribute will be a set of entries from the Taxonomy databank as de ned by the loader object Taxon Class. Finally, SeqType is a computed attribute. No token is extracted from the databanks. Instead, a value 'DNA' is assigned if the entry comes from EMBL which is a databank or DNA sequences; or 'Protein' if the entry comes from SWISSPROT which is a databank of Protein sequences. The object Enzyme Class de nes a loader for the ENZYME databank. We do not detail its de nition for it works exactly the same way as the My Sequence loader and is much simpler.

11

4 The CORBA pilot application In this section we focus on generation of CORBA loaders which is the application of our approach we have most investigated so far. CORBA (Common Object Request Broker Architecture) is a unifying framework that allows distributed software components using heterogeneous computing systems, operating systems and languages to interoperate. It is thus considered a good candidate for the integration of computational biology data and applications. The SRS system is built on the Token Server and the Loader Generator. It enables interconnecting and querying, through the SRS Query Language, of semistructured at le databanks. Unlike SQL or OQL (in fact the algebras they are based on), the SRS Query Language has no join operator but two link operators denoted by '': if A and B are sets of entries of two linked databanks, the query 'A > B ' will return all entries in B that are linked to one or more entries in A. A Web interface provides a remote access to data but returns only textual information while a C-API provides (very low C-API style) structured data (C structures) but cannot be accessed remotely. Some applications however need both (a) remote access to SRS servers and (b) objects (instead of text) so that they can achieve complex treatments that cannot be done within SRS (such as visualizing data). This can be achieved by using SRS Object Servers (SRSOS). SRS Object Servers are CORBA servers that can be accessed remotely by applications (clients) through an Object Request Broker (ORB) [OMG95a]. An SRS Object Server provides access to an SRS server and must thus resides on the same computer as this SRS server { as well as a CORBA ORB of course (we used ORBacus from Object Oriented Concepts Inc. for our experiments). SRS Object Servers exhibit two kinds of objects: Loader Objects actually contain the information. They extract data from SRS and convert them according to loader speci cations, Common Objects provide the \glue" between loader objects. They make SRS Object Servers self-contained by providing some services which actually provide access to the data contained in loader objects.

4.1 Loader Objects

The CORBA objects that actually contain data and that are accessible to client applications, the loader objects, are instances of an interface (class) hierarchy. The root of the hierarchy is the interface Loader, referred to as the 12

Basic Loader (the absolute default loader). The organization in hierarchy allows the use of the same loader for entries of groups of related databanks and the de nition more and more speci c loaders.

The Basic Loader The basic loader Loader is the most generic loader. It is very simple. It does not provide a lot of information but it can be used to load entries of any databank. The Loader interface IDL de nition is the following: interface Loader { readonly attribute string source; readonly attribute string loader; readonly attribute string ID; Collection link( in string target_source, in string loader_name); };

The attribute source represents the actual source databank from which the entry has been loaded using the loader loader. ID is the entry identi er. Identi ers can be used to compare a loader object with an SRS entry retrieved in another way (command line, Web interface or C-API). IDs can also be used to compare objects: two objects may be instances of dierent loader interfaces but actually represent the same databank entry loaded with two dierent loaders. Loader objects exhibit a linking capability. The operation link retrieves entries from a databank target source that are linked with the current entry in the current databank. Note that the two databanks may not be linked directly like ENZYME and PROSITEDOC in our example (there is an explicit link to PROSITEDOC in ENZYME). An SRS system can be seen as a network of databanks. SRS is able to link any databank to any other databank (except databanks that are not connected in any way to any other databank) by computing the shortest path between the two databanks.

CORBA Mapping and Loader Generation In an SRS Object Server, most objects (the loader objects) are in fact generated by the Loader Generator based on loader speci cations as explained in Section 3. They belong to interfaces that inherit (directly or indirectly) from the Loader interface. They have particular attributes and methods that match attributes de ned in the loaders speci cations according to a mapping we do not expose here completely. Very roughly, \local" loader attributes (i.e., attributes 13

whose corresponding tokens come from the wrapped databank) are mapped into CORBA attributes except sequence attributes which received a special treatment. They are mapped into CORBA operations in order to have a parameter format which de nes in which format the sequence should be provided (a lot of databanks or applications require their own format for sequences). Foreign and composition attributes are mapped into operations as well. Finally, some other operations provide a means to launch applications (sequence alignments, sequence similarity searchs, etc.) on entries. These operations can be generated because the Loader Generator knows which operation can be launched on entries of each databank by using meta-information provided by SRS. For the databank ENZYME for instance, using the loader Enzyme de ned in Section 3, the Loader Generator would generate the following IDL interface: interface Enzyme : Loader { readonly attribute string AltName; readonly attribute string Description; readonly attribute string CatalycActivity; readonly attribute stringSeq Comments; PrositedocSeq Prositedocs ( ) raises (InternalCommFailure); };

But the Loader Generator will not stop here. It will not only generate the IDL de nition but also the actual C++ implementation of the interface. This implementation provides for instance a very important method called load. This method, whose signature is void load (ENTRY entry), actually loads the databank entry entry (ENTRY is a Token Server de ned type) using the loader Enzyme so that the attributes de ned in the loader may be accessed through the CORBA attribute AltName, Description, CatalycActivity and Comments. The Loader Generator would also generate the code of the method Prositedocs which returns the collection of entries from the databank PROSITEDOCS linked to the current ENZYME entry. It is the code of such methods that contains all runtime communications with the Token Server to extract dynamically the data from the databank. In one of our test servers providing 7 loaders for 15 databanks, there are about 150 lines of IDL and 3000 lines of C++ produced. These gures should only be taken to give an idea of the amount of code produced. Of course, this code depends greatly on the number and complexity of the loaders. It 14

should also be noticed that the C++ generated code is quite repetitive and not very pleasant to write since it mainly consists of low level C-API calls to the Token Server. This shows that even for small servers, it is much faster and more user-friendly to generate the loaders than to hand-code them. It is also much easier to update when changes in databank formats occurs.

4.2 SRS Object Servers

Loader objects contain the data extracted from databanks by SRS but they are not standalone. They need other objects and services to access and query them. These objects and services are referred to as Common Objects because they are generic and they are the same for all SRS Objects Servers. Common objects and loader objects together form SRS Object Servers. CORBA front-ends to SRS Object Servers are instances of the interface SRSOS. These objects are registered to the Naming Service [OMG95b]. Clients can get a reference to these objects through the Naming Service and send them requests to get the four services described hereafter. The SRSOS interface is de ned as follows: interface SRSOS { QueryEvaluator get_QueryEvaluator(); DatabankFactory get_DatabankFactory( in string Name); LoaderFactory get_LoaderFactory(); MetaDescriptor get_MetaDescriptor(); };

Querying Databanks QueryEvaluator objects submit and remove queries on SRS databanks. They represent the most common use of SRS Object Servers. The interface matches closely the speci cation of the OMG (Object Management Group [OMG95b], Query Service chapter): interface QueryEvaluator { readonly attribute QLTypeSeq types; readonly attribute QLType default_ql_type; Collection evaluate ( in string query, in QLType ql_type, in string queryName, in ConversionSchemeSeq conversions);

15

short remove (in string queryName); };

The attributes types and default ql type describe the available query languages and default query language respectively. For the time being, only the SRS Query Language1 is available; but work is in progress to implement (subsets of) OQL. The operation evaluate submits the query to the SRS system, gets the set of entries satisfying the query, builds and nally returns the corresponding collection of loader objects. While waiting for ORB vendors to provide Collection Services, SRS Object Servers provides a simple generic Collection interface that facilitates the manipulation of objects: addition of an element, suppression of an element, iteration over elements, retrieving of an element by its position (if the collection is sorted), etc. The returned objects are built according to some conversion schemes given in the parameter conversions. A conversion scheme is described by an IDL structure that associates a loader to a databank (it speci es which loader has to be used for the databanks involved in the query): struct ConversionScheme { string DatabankName; string LoaderName; }; typedef sequence ConversionSchemeSeq;

Parameters ql type, queryname and conversions are optional. During the evaluation, if a conversion scheme is not speci ed for a given databank, the operation evaluate uses the default conversion scheme (default Loader). If this default loader does not exist at all for a given databank, evaluate uses the basic loader (called Loader). Query names are useful for combining query results (logical operators, etc.). In SRS, it is not possible to de ne a query with a query name that already exists. The operation remove deletes a query so that its name can be used again later. 1 The SRS Query Language is based on the relational algebra except that, as mentioned earlier, it is does not have a join operator but two unidirectional link operators '>' and '

Extracting and Converting Data from Semistructured ...

Extracting and Converting Data from Semistructured ...

Suggest Documents

Extracting Schema from Semistructured Data - CiteSeerX

Extracting Semistructured Data - Lessons Learnt - Semantic Scholar

Extracting Data From ContourPLot

Extracting value from big data

Extracting value from big data

extracting data from virtual data warehouses - wseas.us

Semistructured Data Search - Semantic Scholar

QURSED: Querying and Reporting Semistructured Data

Foundations of Semistructured Data - Core

Multidimensional Semistructured Data - Semantic Scholar

Semistructured Data and XML 1 Introduction 2

Extracting Relational Data from HTML Repositories - CiteSeerX

Extracting Assumptions from Incomplete Data - CiteSeerX

Extracting Metadata From the Data Analysis Workflow

Extracting Predictive Information from Heterogeneous Data Streams ...

Extracting Patterns from Guitar Accompaniment Data: Some ...

Extracting Structured Data from Web Pages - CiteSeerX

data mining techniques for structured and semistructured data

Extracting Gait Parameters from Raw Data Accelerometers

Extracting Assumptions from Missing Data - CEUR Workshop

Extracting Semantic Information from Visual Data

Extracting Discriminative Shapelets from Heterogeneous Sensor Data

Extracting Data from WSNs - PerLa - SourceForge

Extracting XML Data from HTML Repositories The