Journal of Structural Biology 125, 97–102 (1999) Article ID jsbi.1999.4103, available online at http://www.idealibrary.com on
The BioImage Database Project: Organizing Multidimensional Biological Images in an Object-Relational Database J. M. Carazo* and Ernst H. K. Stelzer† *Centro Nacional de Biotecnologı´a-CSIC, Campus Universidad Autonoma, E-28049 Madrid, Spain; and †Light Microscopy Group, Cell Biology and Cell Biophysics Programme, European Molecular Biology Laboratory (EMBL), Meyerhofstrasse 1, D-69117 Heidelberg, Germany Received November 5, 1998, and in revised form February 4, 1999
developing databases or infrastructures. They organize (usually vast amounts of) information, keep track of the deposited information, efficiently locate desired information, and retrieve selected subsets of information. Using the computer to collect data and to make it available through networks is an excellent example of the influence of new technologies. Paper files, libraries, collections of videotapes, and computer files containing text, spreadsheets, and programs are replaceable by documents that seamlessly include all of the above. Behind all these efforts is the successful abstraction of the various kinds of documents using object-oriented database technologies. Accessing these through Web-oriented interfaces makes handling of and access to large collections of complex data sets a manageable activity.
The BioImage Database Project collects and structures multidimensional data sets recorded by various microscopic techniques relevant to modern life sciences. It provides, as precisely as possible, the circumstances in which the sample was prepared and the data were recorded. It grants access to the actual data and maintains links between related data sets. In order to promote the interdisciplinary approach of modern science, it offers a large set of key words, which covers essentially all aspects of microscopy. Nonspecialists can, therefore, access and retrieve significant information recorded and submitted by specialists in other areas. A key issue of the undertaking is to exploit the available technology and to provide a well-defined yet flexible structure for dealing with data. Its pivotal element is, therefore, a modern object relational database that structures the metadata and ameliorates the provision of a complete service. The BioImage database can be accessed through the Internet.
DESIGN PHILOSOPHY
Images recorded with the various microscopic techniques have properties that are common to all techniques and others that are specific to only some. A certain collection of properties of an image (e.g., specimen, technique, instrument, and processing steps) must be known if the image (i.e., the result of a scientific experiment) is to be useful for scientific purposes. However, not all methods apply to all images; e.g., confocal microscopes do not record tilt series and no methods to handle such data are required. Object-oriented databases, therefore, seem the most obvious choice to store the image data and to handle their properties and methods. Images simply become objects that have particular attributes and are subject to manipulation by specific methods. The object-orientation also helps in designing the interfaces to the database in such a manner that they are easily integrated in today’s Weboriented approaches. Even very complex databases can be easily handled and new information can be distributed rapidly.
r 1999 Academic Press
INTRODUCTION
Science is team-oriented. The members of a team have their specialties; they perform their experiments and communicate their results to collaborators and to all members of the scientific community (usually by means of publication). Important for each team are the information, the means of its distribution, and the facilities that synchronize the efforts of all team members. The existence of infrastructures that keep small as well as large sets of information organized and accessible is crucial, independent of how important this is for any efficient approach to solving a problem. Information needs to be compiled in readily accessible pieces. The more individuals contribute, search for, and read the data, the more important becomes its organization. This is a key idea behind 97
1047-8477/99 $30.00 Copyright r 1999 by Academic Press All rights of reproduction in any form reserved.
98
CARAZO AND STELZER
An important aspect is that BioImage addresses scientists with different backgrounds. This means they tend to use certain technical terms and to rely on assumptions, which might be misunderstood in other circumstances. In consequence, the larger the number of topics of research in which a data set is offered, the more information must be provided. HISTORICAL ASPECTS
Many databases that provide information relevant to the life sciences are now known. However, none were designed to offer the volumes obtained by three-dimensional electron microscopy (EM) or the series of optical sections acquired by confocal light microscopy. Despite the fact that key biological information has been produced by a large number of different types of microscopes for many years, this information was neither organized nor became generally easily accessed. This situation was clearly unsatisfactory, and the task of creating a new database containing and describing multidimensional images follows a request by those in the global scientific community that produce and use these data. Summarizing the experience gained from the many systems now available, the path along which databases have evolved involves four major steps: (1) New biological information can be produced due to the development of an innovative technology. (2) A number of laboratories have access to this new technology and develop an internal archiving standard for their own results. (3) The scientific community realizes the importance of the information stored in hidden archives. (4) Initiatives are launched to organize this information into databases and to provide an infrastructure to gain access to these databases as a service to the global scientific community. The BioImage Database Project aims at structuring and collecting multidimensional data sets recorded with the various microscopical techniques relevant to biology. Therefore, a number of exploratory studies were carried out between 1993 and 1996. One centered on macromolecular structures and outlined some of the general organizational principles applicable to complex image data (Marabini et al., 1996). Another development was the implementation of databases that describe both a confocal microscope (e.g., the Carl Zeiss LSM510) and the data recorded with it (Salmon et al., 1999). In late 1996, a major initiative was launched that included scientific laboratories in Madrid, Heidelberg, Barcelona, Basel, and Oxford, industrial partners such as Silicon Graphics and Informix, and the major biological database provider in Europe, the European Bioinformatics Institute (EBI) based in Hinxton. With the help of a set of associated laboratories they designed a new database, BioImage, which
is now available to the global scientific community (Carazo et al., 1998, 1999). MULTIDIMENSIONAL BIOLOGICAL IMAGES
In engineering terms, multidimensional images are defined as a set of signals that are a function of several dimensions. Objects of the BioImage Database Project are mainly three-dimensional images with three spatial axes x, y, and z (commonly referred to as ‘‘volumes’’). But the term ‘‘multidimensional’’ also refers to two- and three-dimensional images as a function of time (videos), as well as twodimensional images that directly convey a fractional dimension, such as the topography obtained by atomic force microscopy (AFM) or the surface relief obtained by scanning electron microscopy (SEM). The project, however, is not restricted to these few techniques, and an attempt has been made to include a large range of techniques (Lindek et al., 1999). In the simplest of these cases, the digital information is presented picture element by picture element, line by line, etc. The actual value of each sample has a functional relationship to some property of the object; e.g., in three-dimensional EM sample values are related to the Coulomb potential distribution of the specimen. In confocal fluorescence microscopy one records the three-dimensional fluorophore density distribution, which is related to an antibody, lipid, or antigen density distribution. Using vital labeling techniques such as GFP fusion proteins, one can easily record three-dimensional GFP density maps as a function of time. A graphical representation of multidimensional images (in particular volumes) is presented in Fig. 1. CONTENT OF BIOIMAGE
Although the core of BioImage is formed by relevant multidimensional biological images, a reference to the location of the image is only one of many of its entries. To be useful, every image must be properly identified and described. This descriptive information is referred to as metadata. In fact, most of the resources have been spent to identify and to structure the properties that are relevant when describing the circumstances in which a multidimensional image was obtained. Great care was taken to identify the properties that are shared by the different methods, trying hard to make the interdisciplinary use of the database as simple as possible. These properties include the sample preparation procedures, the condition of the instrument, the data processing steps, and the affiliations of the people involved. Therefore, every multidimensional image in BioImage is accompanied by a description of the biological specimen (e.g., taxonomy, macromolecular organization) as well as an account of the experimental details involved in the sample preparation, obser-
BIOIMAGE DATABASE PROJECT
FIG. 1. Graphical representation of a three-dimensional digital image (a volume). (a) A cube formed of subcubes (voxels) in an xyz coordinate system. The voxels represent the sampled values of the changing grayscale pattern. (b) Planes through the cube shown in (a) are grayscale images. They show the structure of the FMDV-SD6 viral complex at medium resolution. It is apparent that, at this level of resolution, the information cannot be modeled as strings of atomic coordinates. (c) The multidimensional information can be visualized and studied by a large number of techniques. One of the possible ways to represent three-dimensional continuous density data is a simplified surface rendering obtained from the data presented in (b). It should be realized that the structural data coming from experimentation are a digital volume such as that shown in (a) and that this is the actual multidimensional information in BioImage.
vation, and subsequent data processing. Every entry in BioImage is a combination of data (multidimensional images) and metadata (structured descriptors). An important part of the metadata is the links between pieces of information in BioImage and in other well-established databases (Lindek et al., 1999). BioImage, therefore, pays special attention to the development of these links, which are used whenever possible. Some of them are common to all databases, such as bibliographic and taxonomic links, while others are more specific to certain scientific areas (e.g., PDB (Abola et al., 1997), SWISS-PROT (Bairoch and Apweiler, 1998), and EMBL Nucleotide Sequence Database (Stoesser et al., 1998)). Most likely future versions of BioImage will include even more links to specific databases (e.g., set up by manufacturers of instruments, dyes, reagents, and methods), as they become available. In any case, an important issue, which requires an excellent idea of what is going on in related areas, is not to repeat a collection that is already prepared elsewhere. BIOIMAGE ORGANIZATION
BioImage is a heterogeneous database. It contains many different data types and the data come from
99
different sources. Nevertheless, BioImage is implemented as a single database for all multidimensional images and, at present, is capable of handling data recorded with atomic force, electron, and light microscopes. It handles descriptors for macromolecular complexes and cells as well as the taxonomy of the donor animals or plants. The metadata superset is very rich, but still consumes very little space in comparison to the size of even the simplest multidimensional image. Obviously, some entries contain information more related to the structural biology of macromolecules, while others contain information more relevant to cell biology. It soon became clear that the large number of entries that can be handled by BioImage was confusing for potential data submitters. A support of appropriate data submission procedures was in fact the key motivation behind establishing two database servers, one at CNB in Madrid and the other at EMBL in Heidelberg. Madrid handles the submissions of macromolecular data, while Heidelberg handles all submissions related to light microscopy, macroscopic data, and the less commonly used imaging techniques (e.g., NMR, MR, acoustic, X-ray microscopy). The two database servers have exactly the same data model implemented in exactly the same way on the same platforms. Also, the user’s view of the database at the two sites, i.e., the interfaces for query and visualization, is identical. The differences appear during the submission process, which is tailored to specific communities of users, during the curation process, since the required technical and biological expertise is different, and at the technical level, since data sets are stored only on the server where they were submitted. The scientific teams at the two server sites collaborate with a number of scientific laboratories as well as industrial partners. The laboratory in Madrid provides its expertise in three-dimensional EM. The laboratory in Basel is recognized for its achievements using EM as well as AFM. Barcelona excels with its experience in X-ray diffraction and its combination with EM in the work on viral structures. The group at EMBL in Heidelberg is a wellknown, established developer of confocal and video microscopes and is backed up by numerous applications in cell biology. Oxford provides the experience for handling video sequences. The industrial partner Silicon Graphics provides its knowledge in handling and visualizing multidimensional data, while Informix has the expertise in the difficult and emerging field of designing and implementing large distributed object-oriented databases. Finally, the advice of EBI, an EMBL outstation, guarantees that BioImage is developed in coordination with the major biological databases.
100
CARAZO AND STELZER
BioImage complements this organization by a network of associated laboratories (test users) that check the submission and query interfaces and provides the additional expertise that the main partners in the project lack. The advantages of incorporating these laboratories have been felt during two key tasks. During the design phase we took advantage of their broad perspective to make the implementation general and to close in on user-friendliness. During the data population phase they were the first to deposit data in BioImage and to test its interfaces. INTERFACES TO BIOIMAGE
BioImage has been designed from the onset to be accessible through the Internet. This is consistent with mainstream technology and makes it relatively simple to query, download, and visualize single images and complete three-dimensional data sets. Any browser suffices to access BioImage. The users will first interact with BioImage through its query interface (Lindek et al., 1999). It enables the user to perform searches using the metadata keys, which describe the multidimensional image data. Examples are the name of the specimen, the major technique, and the names of the authors. A more advanced interface in which these standard queries will be complemented by queries for structural content of multidimensional data is currently under development. As a result of this query the user receives a list of data sets in BioImage, along with a brief description and a thumbnail image. Selecting any of these entries, one enters a series of pages that present the metadata, links to related entries in other databases, and links to the data. The data can then be downloaded and manipulated within the Web environment using, e.g., the graphical tools developed within this project and presented in another article of this issue (Pittet et al., 1999). The data can also be directly retrieved from the server for further studies in the user’s laboratory. Data producers who want to submit their data to BioImage use the submission interfaces. They are designed to drive the submitter into an environment in which a relatively small number of key questions are asked. All metadata are acquired in this way. The actual data, i.e., the multidimensional images, are either deposited at a defined FTP site by the user or fetched by the database whenever it is considered opportune. CONTRIBUTING TO BIOIMAGE
The first implementation of BioImage was presented to the scientific community in September 1998 (Carazo et al., 1998). For data submissions, BioImage is now seeking contributions of multidi-
mensional images. Submitters are suggested to access directly the data server pertinent to their type of work, since the submission interfaces as well as the curation process are different at the two sites. In any case a peer-reviewed publication is expected to scientifically back up the submitted data. Data can also be submitted and kept ‘‘on hold’’ until publication. Besides the general call for data producers to submit to BioImage, the project contemplates three special areas of interest in which a fast data population is expected through an active scheme of direct contact with authors: (1) viral structures (coordinated by Stephen Fuller, contact Ignaci Fita at mailto:
[email protected]), (2) membrane proteins (coordinated by Andreas Engel, contact Bernard Heymann at mailto:
[email protected]), and (3) the cytoskeleton at the light microscopy level (coordinated by Ernst Stelzer, contact mailto:
[email protected]). PERSPECTIVES OF THE BIOIMAGE DATABASE PROJECT
The importance of any newly developed infrastructure is evaluated in terms of its impact. The potential of BioImage as the first implementation of a database for biological volume data should cover many interests ranging from the population of the database to its impact on other infrastructures. The access to and the correlation of complex data create the scientific capabilities. The authors expect BioImage to become a reference database, well accepted by the global scientific community and well populated by old and new, but always relevant, data. Through the combination of data at different levels of resolution, we also expect to have an impact on the increasingly important possibility revealed by the combination of the structural biology of macromolecules and the analysis of dynamic events. Access to and manipulation of data, as well as internal organization, reflect the close relationship between BioImage and the scientific community. Even more ambitious is the goal of starting to exploit the complementary information provided by the different sources of multidimensional images that BioImage presents through a consistent description of the data. Another great challenge is to establish close links between BioImage and the manufacturers of microscopes and image processing software systems. The instrument that records the data as well as the software that processes it stores a substantial fraction of the metadata BioImage asks for during the submission process. Standardization and the additional developments that automate the transfer of information have been realized in only a few cases, hence making this a technological challenge for
101
BIOIMAGE DATABASE PROJECT
future developments. Furthermore, it is desirable that visualization tools make full use of the information provided by BioImage. Different magnifications along the axes, annotations provided by other users, supplementary videos, still images, and lists of objects are all part of the metadata and new software developments can easily make use of them. Another important aspect is the use of a priori knowledge during a reconstruction process, which could make use of information provided by other techniques and earlier analyses in an ideally automated fashion. COPYRIGHT AND OTHER LEGAL CONCERNS
Serious problems of all databases include the access policy and the copyright of data. Who owns the data that are stored in the database? Who is allowed to access it? Does he/she have to pay for the access? Who is allowed to sell the data? What are the data retrievers allowed to do with the data? These are important questions that, in a world where huge amounts of data are easily moved to some anonymous place in fractions of a second, and with people whose only product is information, must be answered in a satisfactory manner. The questions are crucial and should by no means be taken lightly. These issues are, however, not specific to BioImage: they concern all databases. The increasing number of databases is forcing legislators to specifically address these basic legal questions. The images resulting from scientific experiments and the metadata describing those images are information packages that are provided by individual scientists typically working within an institutional or company framework. The authors and their affiliates share the ownership of the data. By submitting them to the BioImage database they assign a copyright of these images to BioImage for the sole purpose of making them available, while retaining the right to use them for all other purposes themselves. Such a copyright transfer is similar to what currently occurs when images are submitted for publication in a journal. In addition to a journal submission, BioImage creates an added value by its structure for the metadata. A dialectic way to overcome the copyright issue is to split the BioImage Project into several parts. The BioAgency contains the metadata and pointers to the data, which may be stored in either private or public data collections. The metadata are property of BioImage. BioTools provides the software to query, access, and retrieve the data and supports instrument and software manufacturers interfacing to BioAgency. In such a set-up the data are always owned and stored by whoever has the rights, while BioAgency is a means to offer the data to the global scientific community in a well-defined manner.
Should a data producer lack the infrastructure to make his/her data available, BioArchive stores the data and offers them under appropriate conditions. So while the data themselves are definitely the property of the institution or scientists who produced them, the metadata are the property of BioArchive for the purpose of queries and references to the data. Once a clear legal framework concerning intellectual property rights has been established, the rules governing exceptions for purposes such as education and scientific research, public health, administration, and jurisdiction will be established. These issues are of key importance in an academic and research environment, where we would like to imagine a future in which sequence data, atomic resolution data, confocal and video data, and many other types of information are used in conjunction with BioImage for the international advancement of science. FUTURE
Another important benefit of standardization and organization is stability, an excellent reason to attract private investment. Our current belief is that BioImage is particularly attractive to those in two other industries. On one hand publishers get the advantage of directly accessing a well-compiled and well-organized database of images that permanently refers to their journals. On the other hand application software companies, attracted by the fact that standards for data descriptions and organization are emerging in this community, are provided with a basis for developing specific application software. The BioImage model is based on professional design tools and implemented using a high-end commercially available database. It provides everything that is required for a huge number of entries yet still remains manageable and adaptable to future requirements. BioImage and related names are, therefore, registered names and patents relating to the database have been applied for. CONTACT
BioImage can be accessed through the Internet. It is up to its users to explore the opportunities for unraveling new relationships among data generated by a wide range of techniques. Potential users and those who wish to contribute multidimensional images are encouraged to contact either the Madrid server (for macromolecular structures) at http:// www.bioimage.org or the Heidelberg server (for data recorded at the light microscopic level) http://wwwembl.bioimage.org. Comments and suggestions are welcome in the open e-mail list provided at mailto:
[email protected].
102
CARAZO AND STELZER
The European Union, through Grant PL 960472, is the primary financer of this project. Special actions or extensions are being made possible by core national grants (BIO95-0768, BIO97-1485CE, and BIO98-0761, to J.M.C.). Initial support from the European Molecular Biology Network and many vivid discussions and proof-of-concept works in the field of macromolecular structure performed in collaboration with Dr. Joachim Frank (Albany, NY) are acknowledged by J.M.C. J.M.C. also thanks the participants of the 1993, 1995, and 1997 Gordon Conferences on ThreeDimensional Electron Microscopy for their support and ideas during all these years. We acknowledge the support and the contributions from the BioImage Test Users (A. Baker, J. BereiterHahn, A. Brisson, S. Castel, E. Dimmeler, K. H. Fuchs, S. Fuller, H. Gross, R. Hegerl, I. Karl, W. Keegstra, S. Marco, M. Miles, R. Schindler, P. Shaw, H. J. Tanke, J. M. Valpuesta, S. Vilaro, R. Wade, and J. Walz). E.H.K.St. acknowledges N. Salmon for his work on databases for the CCC and the CZJ LSM510, which preceded the developments of the BioImage database. We thank Elizabeth Hewat for granting BioImage access to the Electron Microscopy data of the FMDV-SD6 complex with which Figs. 1b and 1c were prepared. REFERENCES Abola, E. E., Sussman, J. L., Prilusky, J., and Mannin, N. O. (1997) Protein data bank archives of three-dimensional macromolecular structures, Methods Enzymol. 277, 556–571. Bairoch, A., and Apweiler, R. (1998) The SWISS-PROT protein sequence data bank and its supplement TrEMBL, Nucleic Acids Res. 26, 38–42.
Carazo, J. M., Chagoyen, M., Lucini, J. L., Stelzer, E., Lindek, S., Fritsch, R., Engel, A., Heyman, J. B., Fita, I., Kalko, S., Henn, C., Pittet, J. J., McNeil, P., Rodrı´guez-Tome´, P., Shotton, D., Boudier, T., and Machtynger, J. (1998) BioImage, the data base of volume images: Update on its first prototype, Int. Cong. Electron Microsc. (ICEM 14) 1, 197–198. Carazo, J. M., Stelzer, E., Engel, A., Fita, I., Henn, C., Machtynger, J., McNeil, P., Shotton, D. M., Chagoyen, M., de Alarco´n, P. A., Fritsch, R., Heymann, J. B., Kalko, S., Pittet, J. J., Rodriquez-Tome´, P., and Boudier, T. (1999) Organising multidimensional biological image information: The BioImage Database, Nucleic Acids Res. 27, 280–183. Lindek, S., Fritsch, R., Machtynger, J., de Alarco´n, P. A., and Chagoyen, M. (1999) Design and realization of an on-line database for multidimensional microscopic images of biological specimens, J. Struct. Biol. 125, 103–111. Marabini, R., Vaquerizo, C., Ferna´ndez, J. J., Carazo, J. M., Engel, A., and Frank, J. (1996) Proposal for a new distributed data base of macromolecular and subcellular structures from different areas of microscopy, J. Struct. Biol. 116, 161–167. Pittet, J.-J., Henn, C., Engel, A., and Heymann, J. B. (1999) Visualizing 3D data obtained from microscopy on the internet, J. Struct. Biol. 125, 123–132. Salmon, N. J., Lindek, S., and Stelzer, E. H. K. (1999) Databases for microscopes and microscopical images, in Ja¨hne, B., Haußecker, P., and Geissler, P. (Ed.), Handbook on Computer Vision and Applications, Academic Press, Boston, in press. Stoesser, G., Moseley, M. A., Sleep, J., McGowran, M., GarciaPastor, M., and Sterk, P. (1998) The EMBL Nucleotide Sequence Database, Nucleic Acids Res. 26, 8–15.