BioXSD 1,2
3
4
An XML Schema for sequence data, features, alignments, and identifiers 5
6
1
7
3
Matúš Kalaš , Edita Karosiene , László Kaján , Sveinung Gundersen , Jon Ison , Pål Puntervoll , Christophe Blanchet , Kristoffer Rapacki and Inge Jonassen 1
2
3
1,2
4
Computational Biology Unit, Uni Computing and Department of Informatics, University of Bergen, Bergen, Norway; Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark; Bioinformatics and Computational Biology Department, Technische Universität München, Garching, Germany; 5Institute for Cancer research, Oslo University Hospital, Oslo, Norway; 6European Bioinformatics Institute, EMBL, Wellcome Trust Genome Campus, Hinxton, 7 Cambridge, UK; Institut de Biologie et Chimie des Protéines, CNRS and Université Claude Bernard Lyon 1, Lyon, France.
[email protected] ADVANTAGES OF XML WITH XSD
BACKGROUND The EMBRACE project (EU FP6, 2005-2010) explored ways of providing smoothly interoperable bioinformatics tools in form of Web services. It initiated the development of BioXSD.
Textual and tabular formats, XML, and RDF have each their own advantages in certain usage scenarios.
Advantages for users of tools:
Advantages for providers of tools:
Automatic input validation (*)
usability
security
The table shows certain advantages of XSD-based XML formats over textual formats.
usability
* … and those over schema-less RDF.
usability
Auto-generation of objects and GUIs *
maintainability
scalability
Efficient compression (with EXI, a W3C standard) *
scalability
semantics
Semantic annotation of format’s details (with SAWSDL)
semantics
An XSD (i.e. XML Schema) defines data objects, just as object-oriented programming languages do. In particular with Web services, an XSD is mandatory and useful.
MIX-AND-MATCHING OF TOOLS
less effort
Easier conversion of formats
Workflow programming easier & faster
Standard parsing
Ready-made I/O building blocks: development easier & faster (*)
maintainability
less effort
EXAMPLE WORKFLOW a) Different formats
b) Common format
blue rectangles are Web-service calls, red ovals are data
Without a common format, communication between diverse tools demands proprietary parsing, transformations, “shims”, and maintenance of them in the future. Using common data formats makes workflow construction and maintenance easier and faster.
The 2 scenarios show demands for connecting 2 tools (such as Web services) that use: a) Different formats b) Common format
Smooth!
EXAMPLE DATA: BioXSD Sequence record Basic example:
Type diagram:
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATA FMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLL LLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLAL FLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTII GQMASILYFSIILAFLPIAGXIENY