XEMBL: distributing EMBL data in XML format

14 downloads 23425 Views 93KB Size Report
Mar 5, 2002 - to represent complex data and relationships; the meaning- ful units of information ... objects into a XML document complying to the selected.
BIOINFORMATICS APPLICATIONS NOTE

Vol. 18 no. 8 2002 Pages 1147–1148

XEMBL: distributing EMBL data in XML format Lichun Wang, Jean-Jack Riethoven and Alan Robinson European Bioinformatics Institute, EMBL Outstation – Hinxton, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK Received on January 9, 2002; revised on March 5, 2002; accepted on March 13, 2002

ABSTRACT Summary: Data in the EMBL Nucleotide Sequence Database is traditionally available in a flat file format that has a number of known shortcomings. With XML rapidly emerging as a standard data exchange format that can address some problems of flat file formats by defining data structure and syntax, there is now a demand to distribute EMBL data in an XML format. XEMBL is a service tool that employs CORBA servers to access EMBL data, and distributes the data in XML format via a number of mechanisms. Availability: Use of the XEMBL service is free of charge at http://www.ebi.ac.uk/xembl/, and can be accessed via web forms, CGI, and a SOAP-enabled service. Contact: [email protected]; [email protected]; [email protected] Supplementary information: Information on the EMBL Nucleotide Sequence Database is available at http://www. ebi.ac.uk/embl/. The EMBL Object Model is available at http://corba.ebi.ac.uk/models/. Information on the EMBL CORBA servers is at http://corba.ebi.ac.uk/.

The EMBL Nucleotide Sequence Database (http://www. ebi.ac.uk/embl/) hosted at the EMBL’s European Bioinformatics Institute (EBI) is a comprehensive database of DNA and RNA sequences and is managed and maintained using the relational DBMS Oracle. It contains over 130 tables and 140 relationships, having around 15 billion nucleotide bases stored in over 14 million objects of primary data and comprising millions of sub-objects called features. Traditionally, the sequences and related information are made available freely in flat file format via FTP, CD-ROM, WWW tools, etc. Queries through tools such as SRS (Sequence Retrieval System, http://srs.ebi.ac.uk/) (Etzold et al., 1996; Lion) also return data in flat file format. However, flat files have a number of shortcomings: the format may not be described formally; it is difficult to represent complex data and relationships; the meaningful units of information (‘objects’) are not represented or handled well; it is hard to retrieve objects separately; assembly of objects into bigger aggregates is difficult; elaborate parsing is often required; etc (Wang et al., 2000). c Oxford University Press 2002 

In general, the currently available resources lack the flexible environment needed to meet an individual researcher’s needs today. The Object Management Group’s (OMG) Common Object Request Broker Architecture (CORBA) (http://www.omg.org/corba/) and the World Wide Web Consortium’s (W3C) Extensible Markup Language (XML) (http://www.w3.org/xml/) are emerging as two mechanisms of standardization for distributed computing and web applications, and have attracted extensive attention in the biological field (Wang et al., 2000; Achard et al., 2001). To address the problems of the EMBL flat file format, the EMBL CORBA infrastructure has been developed at EMBL-EBI (Wang et al., 2000) to make EMBL data available in terms of ‘objects’. It not only provides an efficient means of accessing and distributing EMBL data, but also offers a flexible environment for users to develop their applications (for example, sequence analysis, data mining, etc.) by building clients to CORBA servers. However, whilst CORBA operates efficiently on local area networks, it is often blocked by firewalls when attempting to use it over the Internet. In addition, CORBA is often perceived as complex to use. For these reasons, the interest in XML technology has increased. Although XML’s data modelling capabilities are more restrictive than those of CORBA and IDL, it defines a data structure and format for documents that can address some problems of flat file formats. XML may be understood by both human beings and computers. XML may be distributed over the Internet by employing SOAP (Simple Object Access Protocol) which uses the all-pervasive HTTP that can pass through firewalls and furthermore distributed as a web service through application programming interfaces defined using WSDL (Web Service Definition Language). XML is starting to serve as a standard data exchange format between developers and designers. Hence the Information Informatics Infrastructure Consortium (I3C) is proposing an XML subgroup for bioinformatics (http://www.i3c.org/). XEMBL is a project carried out at EMBL-EBI which investigates exploiting XML with CORBA to improve the distribution of EMBL data. It provides end users a service tool that distributes EMBL data in XML format (http: 1147

L.Wang et al.

//www.ebi.ac.uk/xembl/). The XEMBL system structure is shown in Figure 1. The EMBL data is described by EMBL Object Model that classifies EMBL data into five packages of information:

End User

XEMBL

Web Perl Client

• Sequence Information that represents biological sequences, general information about these sequences and administrative data associated with database entries; • Feature Information that describes detailed sequence annotations (known as sequence features);

Java Client

Stub

Stub ORB

Skeleton Object Ref Server

Skeleton Object

Object

EMBL Server

JDBC

• Location Information that offers locations on sequences;

Skeleton

Taxo Server

Persistence Oracle Driver

• Reference Information that provides bibliographic references that hold information about the sequences;

EMBL DB

• Taxonomy Information that gives the taxonomy of the organisms from which the sequences were obtained.

Fig. 1. XEMBL system structure

The detailed description of the data and their relationships can be found at http://corba.ebi.ac.uk/models/. The information is served by one of three CORBA servers: EMBL server, Reference server and Taxonomy server (Wang et al., 2000). The EMBL server accesses all the main information on nucleotide sequences whilst the Reference server and Taxonomy server are independent servers for Reference Information and Taxonomy Information. When an end user submits a request for an accession number, XEMBL, which is actually a CORBA client, accesses the CORBA servers to get information from the EMBL database in the form of objects. It then maps the objects into a XML document complying to the selected DTD (Document Type Definition), which defines the data structure and XML format, and returns it the user. For programmatic access to the EMBL data in XML format, clients may connect to the XEMBL SOAP server making use of a simple interface defined in WSDL. The SOAP server is actually a client to the EMBL CORBA server that marshals the data from CORBA objects into the XML format and returns it to the client. A number of DTDs are being proposed as publicly available standard XML formats for exchanging biological data, such as Bioinformatic Sequence Markup Language (BSML) from Labbook, Inc. (http://www.labbook.com/), Architecture for Genomic Annotation, Visualization and Exchange (AGAVE) from former DoubleTwist, Inc., Genome Annotation Markup Elements (GAME) from Drosophila Genome Project/Celera (http://www.bioxml. org/dtds/), and BIOpolymer Markup Language (BIOML) from Proteometrics (http://www.bioml.com/BIOML/). Starting from these specifications, the I3C is striving to reach a consensus on an XML DTD for sequence data. Currently, XEMBL supports XML formats of BSML

and AGAVE. In the future, XEMBL will support more available formats, e.g. GAME, BIOML, etc. With EMBL data in a specific XML format, e.g. BSML, the user can use the Genomic XML Viewer from Labbook (downloadable freely) to view or extract data. Users can also write their own XML parser to extract the relevant data to suit their further research needs. This application shows that exploiting XML with CORBA can improve the distribution of EMBL data by distributing the EMBL objects in XML format via a number of mechanisms. It also shows that CORBA together with XML provides an ideal solution for accessing and distributing biological data to meet individual researchers’ needs.

1148

ACKNOWLEDGEMENTS The authors would like to thank Philip Lijnzaad, Patricia Rodriguez-Tom´e and Nicole Redaschi for their significant contributions to the CORBA servers and the EMBL Object Model. REFERENCES Achard,F., Vaysseix,G. and Barillot,E. (2001) XML, bioinformatics and data integration. Bioinformatics, 17, 115–125. Etzold,T., Ulyanov,A. and Argos,P. (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol., 266, 114–128. Labbook, Inc., BSML DTD and Genomic XML Viewer. http:// www.labbook.com/. Lion Technology Inc., http://www.lionbioscience.com/. Wang,L., Rodriguez-Tom´e,P., Redaschi,N., McNeil,P., Robinson,A.J. and Lijnzaad,P. (2000) Accessing and distributing EMBL data using CORBA (common object request broker architecture). Genome Biol., 1, 0010.1–0010.10.