BioXSD 1,2
The XML Schema for basic bioinformatics data
1
3
4
5
6
3
Matúš Kalaš , Pål Puntervoll , Edita Karosiene , Christophe Blanchet , Sveinung Gundersen , Jon Ison , Kristoffer Rapacki and Inge Jonassen 1
2
3
1,2
4
Computational Biology Unit, Uni Computing and Department of Informatics, University of Bergen, Bergen, Norway; Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark; Institut de Biologie et Chimie 5 6 des Protéines, CNRS and Université Claude Bernard Lyon 1, Lyon, France; Institute for Cancer research, Oslo University Hospital, Oslo, Norway; European Bioinformatics Institute, EMBL, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
[email protected] ADVANTAGES OF XSD
STRATEGY EMBRACE project (EU FP6) has recommended a way of providing smoothly interoperable bioinformatics tools.
Textual and tabular formats, XML, and RDF have each their advantages in certain usage scenarios.
Advantages for users of tools:
Automatic input validation *
usability
The table shows certain advantages of XSD-based XML formats over textual formats.
* … and those over RDF.
An XSD (i.e. XML Schema) defines data objects, just as object-oriented programming languages do. In particular with Web services, an XSD is mandatory and useful.
MIX-AND-MATCHING OF TOOLS
usability
Advantages for providers of tools:
Easier conversion of formats
Parsing “for free”
security
maintainability
usability
Auto-generation of objects and GUIs *
maintainability
scalability
Efficient compression (with EXI standard by W3C) *
scalability
semantics
Semantic annotation of type’s details (with SAWSDL)
semantics
resources
Workflow programming easier & faster
Ready-made I/O building blocks: development easier & faster (*)
resources
EXAMPLE WORKFLOW 1. proprietary formats
2. common format
blue rectangles are Web-service calls, red ovals are data
Without a common format, communication between diverse tools demands proprietary parsing, transformations, “shims”. Using common data formats makes workflow construction and maintenance easier and faster. The 2 scenarios show demands for connecting 2 tools (such as Web services) which are using: 1. Proprietary formats 2. Common format
Smooth!
EXAMPLE DATA: BioXSD Sequence record Basic example:
Type diagram:
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATA FMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLL LLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLAL FLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTII GQMASILYFSIILAFLPIAGXIENY