Information Management for Material Science Applications in a Virtual Laboratory A. Frenkel1, H. Afsarmanesh1, G. Eijkel2, and L.O. Hertzberger1 1
University of Amsterdam, Computer Science Department Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands {annef, hamideh, bob}@science.uva.nl 2 Institute for Atomic and Molecular Physics (AMOLF) Kruislaan 407, 1098 SJ, Amsterdam, The Netherlands
[email protected]
Abstract. The goal of Virtual Laboratory project (VL), being developed at the University of Amsterdam, is to provide an open and flexible infrastructure to support scientists in their collaboration towards the achievement of a joint experiment. The advanced features of VL provide an ideal environment for experiment-based applications, such as the Material Analysis of Complex Surfaces (MACS) experiments, to benefit from different developed interfaces to the hardware and software required by the scientists. To properly support the information management in this collaborative environment, a set of innovative and specific mechanisms and functionalities for efficient storage, handling, integration, and retrieval of the MACS-related data, as well as data analysis tools on the experiment results, are being developed. This paper focuses on the information management in the MACS application case and describes its implementation using the Matisse ODBMS system.
1 Introduction The aim of the Virtual Laboratory (VL) project1 is to provide an open and flexible framework that support the collaboration between groups of scientists, engineers and scientific organizations that decide to share their knowledge, skills and resources (e.g. data, software, hardware, complex devices, etc.) towards the achievement of a joint experiment [1], [2], [11]. The advanced features of VL provide an ideal environment for experiment-based applications to benefit from different developed interfaces to the hardware and software required by the scientists. One of the experiment-based application cases proposed for the VL, is focused on the Material Analysis of Complex Surfaces (MACS) experiments. These experiments involve large and complex physics related devices, such as the Fourier Transformed Infra-Red imaging spectrometer (FTIR) and the nuclear microprobe (mBeam). This 1
This research is supported by the ICES/KIS organization.
H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 165–174, 2001. © Springer-Verlag Berlin Heidelberg 2001
166
A. Frenkel et al.
application case benefits from VL since it is possible to operate these devices remotely in a multiple-user collaboration way and also from the possibility to combine results from different experiments, creating in this way new research opportunities. In order to support the information management involved in this collaborative environment, a set of innovative and specific mechanisms and functionalities for efficient storage, handling, integration, and retrieval of the MACS related data, through the VL, are being developed. These mechanisms and functionalities enable scientists to search through the large amount of stored data in order to identify patterns and similarities. Therefore, the database model is carefully designed to enable an efficient way to store and access the data produced in such scientific environments. The focus of this paper is on describing these information management mechanisms and functionalities specific for the MACS case that are being implemented using the Matisse ODBMS. This paper is organized as follows. Section 2 describes the Virtual Laboratory environment and its reference architecture. In Section 3, the specific domain, i.e. the MACS experiment case is covered. Section 4 presents the development approach and the functional details that support the information management system developed for the MACS application using the Matisse ODBMS system. Section 5 addresses the main conclusions of this paper and some of the future work that is planned in the context of this research project.
2 The Virtual Laboratory Environment The Virtual Laboratory environment provides a framework for groups of scientists, engineers and scientific organizations that interact and cooperate with each other towards the achievement of a common experiment. Such an experimental environment enables researchers, at different locations, to work in an interactive way, as in any laboratory, i.e. the scientists are able to create and conduct the experiments in the same natural and efficient way as if they were in their laboratory. One of the most important characteristics of the experimental domains is the manipulation of large data sets produced by the experiment devices, as described in [1]. To be able to handle the resulting experiment data sets, three main requirements are supported within the VL architecture: - Proper management of large data sets: i.e. storage, handling, integration, and retrieval of large data sets. For example, in such a scientific environment, the size of data sets can range from a few megabytes (e.g. DNA micro-array experiments data sets) to tens of gigabytes (e.g. FTIR imaging micro-spectrometer data sets). - Information sharing and exchange for collaboration activities: scientists are able to share both the devices used to perform the experiments and the data sets generated by those experiments. They must be able also to look at these data sets and compare them to the ones from previous experiments or other public databases, in order to find similarities and patterns. - Distributed resource management: must be properly considered in order to meet the high performance and massive computation and storage requirements.
Information Management for Material Science Applications in a Virtual Laboratory
167
The Virtual Laboratory architecture, shown in Fig. 1, has incorporated these and other functional requirements through the design of different system components. In particular, the VL architecture consists of three main architecture components: 1. The Application Environment contains the scientific application domains considered in the VL (e.g. MACS application case, DNA Micro-array application case, and others), including certain specific domain functionalities. 2. The VL Middleware enables the VL users to access low level distributed computing resources. The VL middleware provides: the VL user interface that enables the scientists to define and execute the experiments; the Abstract Machine (AM) that is the intermediate layer between the Grid infrastructure and the VL users, as described in [2]; and three main functional components: the VIMCO component provides the functionalities to store and retrieve both the large data sets and the data analysis results, the advance functionalities for intelligent information integration and the facilities for information sharing based in a federated approach [1]. The ComCol component provides the appropriate mechanisms for the data and process handling based on the Grid technology. The ViSE component offers a generic Virtual Simulation and Exploration environment where 3D visualization techniques are offered to analyze large data sets. The functionality provided by each one of these components is integrated through the VL integration architecture. 3. The Distributed Computing Environment provides the network platform that enables efficient usage of the computing and communication resources. At present, a Gigabit Ethernet connection is being used. In the near future, it will be extended to a Wide-Area environment using a GigaPort network based on the Surfnet5 backbone, which will result in a speed of 80 gigabits per second and a client connection capacity of 20 Gigabits per second [6]. The Grid infrastructure provides the platform to manage data, resources, and processes in distributed collaborative environments, such as the VL scientific applications. The Globus toolkit offers a set of tools to manage the resources in Data-Grid systems [5], [2], [7], [15]. The functionalities provided by the VIMCO layer and the specific domain tools developed in the VL Interface layer specifically for the Material Science applications are described in details in the following sections. Case 1 FTIR Scanner
Case 2 Microbeam
Case 3 DNA Array
Others
End-user Application Environment
...
...
...
VL User Interface Environment
ViSE
ComCol
VIMCO
VL Integration Architecture
VL Middleware
VL Abstract Machine
... Distributed Computing Environment
Fig. 1. Virtual Laboratory reference architecture
168
3
A. Frenkel et al.
Material Science Application in VL
The goal of the Material Science application is to the study materials and their properties and understanding what happens on surfaces when materials interact. In this section, the Material Analysis of Complex Surface experiment, a specific case of the Material Science application, is described. 3.1 Material Analysis of Complex Surface Experiment The Material Analysis of Complex Surface (MACS) experiments try to identify and determine the elements that compose complex surfaces, regardless of the nature of the sample. Some application areas that benefit from this kind of experiments (some of which are currently implemented or considered) includes: art conservation and restoration (e.g. analysis of binding media and organic pigments in old master paintings), bio-medical science (e.g. identification of arteriosclerotic deposits in mice), medical research (e.g. studies of trace elements in brain tissues), and others. The MACS experiment itself can be divided into three phases as shown in Fig. 2, as the preprocessing, the experimentation process, and the analysis of results. The preprocessing phase is where references to related research and images of the object are collected and analyzed. After this, the sample that will be used during the experiment process is extracted from the object. This process includes several extraction protocols and procedures to be followed. Then usually the sample needs to be treated, for example with reagents and solutions, in order to fulfill the requirements of the device used in the material analysis process for the experimentation phase. The material analysis process is performed with a set of specialized and complex Pre-processing Collect of Related Information
External Database
Object Extraction and Preparation of the Sample Master Paintings
Biological Tissues
Related Research
Polymer Laminates
Experimentation Process
Material Analysis Process
Data File Conversion and Certification
FTIR Device Data Cube
Analysis Process
Data Analysis and Knowlledge Extraction Interpretation Scientists
Analysis Tools
Results
Local Database
Fig. 2. Material Analysis of Complex Surfaces Experiment
Information Management for Material Science Applications in a Virtual Laboratory
169
hardware equipments. At present, the FTIR and the mBeam devices are available. The FTIR facility is a non-dispersive infrared imaging spectrometer coupled to an infrared microscope used to examine the infrared radiation absorbed by complex surfaces, as described in [8] and [4]. The mBeam device provides a highly focused beam of ions, with a spatial resolution in the sub-micrometer range, that can be used to identify trace -15 elements on a surface with a sensitivity of 10 grams as described also in [8]. After the full scan process finishes, the outcome of the experiment is a set of data files, containing the experiment results and the device parameters. This data set consists of a stack of images, known as hyper-spectral data cube. Afterwards, these data files are converted into a format that can be used in the analysis phase. Also a quality control process is carried out to certify that the generated data complies with some standards, otherwise the data is discarded and the material analysis process is redone. The large amount of data produced by these devices makes the analysis phase longer and more effort consuming than the experiment phase itself. For example, the size of one single data cube can range from 16 to 100 Mbytes and considering that every day up to 20 data cubes can be generated, it is understandable that individual scientists cannot do this analysis. Therefore, a set of analysis tools needs to be integrated into the application to facilitate the work of the scientists, e.g. correlation analysis, multivariate data analysis (PCA, pLS) and others.
4 The MACS Information Management System The main goal of this system is to design and develop an open and flexible environment to facilitate the experimentation process for physicists involved in MACS-related experiments. This application case is being developed at the CO-IM group [14] at the University of Amsterdam in collaboration with the physics institutes AMOLF and NIKHEF. The first phase to build the MACS system focuses on the specific mechanisms and functionalities that need to be developed for the information management of the data produced by the FTIR and the mBeam devices. Thus, first the identification of the information management requirements including the study of the structures of the input and output data and the study of the operations on the data of the application domain was done. The next step was the development of the MACS database that included: the design of the database, the development of database prototype, the design and development of tools to load the database, the population of the database with the FTIR and/or mBeam data, and the design and development of the user and query interfaces. The second phase will focus on the development of data analysis and knowledge extraction tools that will be used to process, analyze and present the results in such a way that valuable knowledge can be extracted from the large amount of data generated by these complex devices, i.e. information about experimental resources, experimental parameters and conditions, and raw or processed results.
170
A. Frenkel et al.
4.1 MACS Process-Data Model After studying and analyzing the way in which the MACS experiments are performed, (including data, objects, and processes), a process-data flow model was designed. For this design, the Virtual Laboratory Experiment Environment Data (VL-EED) model was used as a reference model [10]. The VL-EED model is a generic database model for experimentation environments. This model is the result of the careful study of several applications, within the context of the VL project. Therefore, it was possible to determine the generic characteristics of scientific experiments and design a generic schema to store experimental information. The VL-EED model is a template that facilitates the creation of new experiment-based schemas, preventing in this way the duplication of modeling effort, i.e. the database managers do not have to create a new “schema” for each new experimental application. It also enables a more efficient way to share and access the data from different experiment-based application tools, e.g. data analysis tools, browser and query tools. The VL-EED model (shown in Fig. 3) can be viewed as a hierarchy with the class Project as the root. Under each project a number of experiments can be performed. Each experiment consists of experiments elements that can be either processes or data elements. The experiments and the experiment elements can have comments. The processes are actions that can be described by protocols (i.e. standard procedures) and can have properties. The processes may be carried out with the use of hardware or software tools with their parameters and whose vendor is an organization. In addition, a person that belongs to an organization (both with an address) performs the experiments and processes. The relationships between experiments and experiment elements COMMENT date : DATE comment : STRING creator : PERSON has_prev_exp 0..1 0..* EXPERIMENT PROJECT name : STRING id : STRING description : STRING start_date : DATE end_date : DATE url : STRING
project_of
name : STRING id : STRING project_id : STRING type : STRING subject : STRING date : DATE description : STRING published_in : STRING literature : STRING url : STRING
has_next_exp 0..1 has_exp 0..*
experiment_in 1..*
has_comment 0..*
has_comment
has_sub_elm
0..* EXP_ELEMENT has_element
1..* element_of
0..*
has_next_elm
0..*
has_prev_elm
0..*
has_related_exp 0..*
1..* has_submitter
has_super_elm 0..1
name : STRING id : STRING exp_id : STRING description : STRING
has_contributor
1..1 has_property
0..*
0..* ADDRESS street : STRING postal_code : STRING city : STRING state : STRING country : STRING
PROCESS
has_submitted 0..*
has_project ORGANIZATION name : STRING activity_type : STRING phone : STRING fax : STRING email : STRING url : STRING
has_address 0..*
has_address 0..*
has_employee
employee_of
0..*
0..*
has_protocol
0..* PERSON name : STRING id : STRING title : STRING phone : STRING fax : STRING email : STRING url : STRING
has_vendor
0..*
name : STRING id : STRING description : STRING
has_vendor
name : STRING id : STRING description : STRING
0..*
has_defined
0..1
PROTOCOL
has_performed
defined_by
SOFTWARE
DATA_ELEMENT
date : DATE performed_by 1..*
has_contributed
0..* TEMPLATE 1..1
PROPERTY name : STRING num_val : DOUBLE text_val : STRING unit : STRING
name : STRING id : STRING
0..*
HARDWARE name : STRING id : STRING description : STRING
has_hardware
HW_TOOL
has_parameter
0..*
has_parameter
0..*
1..1
PARAMETER SW_TOOL
1..1 has_software
Fig. 3. Virtual Laboratory Experiment Environment Data model
Information Management for Material Science Applications in a Virtual Laboratory
171
are represented by the recursive-relations has_prev_elm and has_next_elm. The goal of this representation is to enable a flexible and random process-data flow. The MACS process-data flow model (shown in Fig. 4) covers the information specific for the material science experiments. Due to the fact that the VL-EED model is flexible and extendible, it was easy to develop the domain specific data model on top of it. Following the VL-EED definition, the MACS experiments consist of experiment elements that can be extended to data elements and/or processes. The data elements can be subdivided into active elements and passive elements considering their participation during the different experimental phases. Thus, the passive elements are just used during the experiment process while the active elements are generated and/or modified by one or more experimental processes. In the figure, for instance, the set of Passive data elements is represented by gray rectangles (e.g. Object, Physics Devices, Analysis Tool, etc.). And the Active data elements, represented by lined rectangles (e.g. Sample, Data Cube, etc. The Processes elements are represented by ovals (e.g. Sample Extraction, Material Analysis, Data Cube Analysis, etc.). 4.2 MACS Information Management System Development The MACS database system was developed using the Matisse object-oriented database management system, which provides a set of database management tools for proper handling of large and complex data from database applications. Some of the advantages of considering Matisse ODBMS for this application include its flexible and dynamic data model, its support to manage many multimedia data types, and the high level of scalability and reliability that it provides, as mentioned in [12]. In order to create the description of the MACS schema in Matisse, the data definition language MATISSE ODL was used. The MACS ODL file provides the description of the persistent data for both of the VL-EED and the MACS schema as a set of object classes, including the attributes and relationships. Once, the MACS ODL file is ready, the next step is to interpret it using the MATISSE mt_odl utility, which creates
Fig. 4. MACS Process Data Flow
172
A. Frenkel et al.
the actual MACS schema in the database. Thus the MACS database schema is stored in the database and can be manipulated like the other objects through the use of APIs. Once the database is set up, the transference of the data from some existing external sources can be done with the loader tool specially developed for this purpose. Thus, the MACS Database Loader is responsible for providing the proper means for uploading data into the MACS database. Therefore, instead of creating one object at a time it is possible to load many objects at once. The format of the source data file of is based on the Object Interchange Format (OIF) file. This format is a specification language proposed in the ODMG standard to dump/load databases objects to/from files, as described in [13]. The MACS Database Loader (presented in Fig. 4) was implemented using Java, in order for the application to be portable between platforms, and also to offer the possibility of using the program as an applet, allowing it to also run remotely from a web browser. For the integration with the MACS database, the Matisse Java API was used [9]. The Matisse Java API, developed at the University of Amsterdam, is a set of library functions that provides a high-level and object-oriented Java access to the Matisse ODBMS. It provides a set of generic data management functions that encapsulate Matisse C API commands. In this way, the applications that are developed do not have to deal with Matisse specificities, and may just provide the necessary information through the access functions. These functions do not necessarily imply a one-to-one mapping in relation to Matisse commands; they can encapsulate a sequence of Matisse commands. The functions contained in this library include: the DB Access functions (e.g. to connect and perform the transactions on Matisse DB), the Data access functions (e.g. to select, update and delete the data in the Matisse DB), and the Meta-data Access functions (e.g. to perform operations on the database schema). 4.3 The MACS Information Management System in the Virtual Laboratory Considering as a scenario case, the experiment for the analysis of highly oxidised diterpenoid acids of Old Master paintings described in detail in [3], a typical experiment developed within the VL environment would consist of the following steps: 1. Through the VL user interface environment (of the VL middleware), the user logs in to the system, and through a VL web-based interface he/she is able to access the VL resources that include physical devices, software and data elements. 2. Using further features of the VL Abstract Machine, the experiment is defined by selecting a number of experiment elements, i.e. processes and data elements and connecting them in order to create a process-data flow. The definition of the experiments is performed using a drag-and-drop interface, which may also provide an intelligent assistant (i.e. VL-AM Assistant) to help the user, during the design of the VL experiment, as described in [2]. It is also possible to load a previous experiment, i.e. an experiment that was performed earlier, or even a pre-defined experiment (i.e. a experiment template). 3. Every application provides a set of user-friendly tools, either specific domain tools or generic tools, to look at the data sets stored in VIMCO. Through the MACS user
Information Management for Material Science Applications in a Virtual Laboratory
173
Fig. 4. MACS Database Loader user interface
interface facilities, the user can access, at any time, the data collected from the ex periments. In this case, the MACS interface allows the user to perform queries on the MACS database, to apply some analysis processes in order to extract valuable information and to provide the facilities to access visualization tools. 4. When the setup of the experiment is finished, the experiment is submitted to the system. At this moment, the VL Abstract Machine Run Time System (VL-AM RTS), uses the tools provided by the Globus toolkit for the Data-Grid management to send the different parts of the experiment, throughout the distributed environment (within the computational grid), according to the computational requirements and the availability of the resources needed. 5. During the execution of the experiment, through the VL user interface environment of the VL middleware, the user is able to supervise their experiments using monitoring tools. Also, it is possible for the user to change the experiment parameters at any moment, in order to adjust the experiment process.
5 Conclusions and Future Work In the VL environment, an important requirement is the appropriate management of the large amount of data produced by the large and complex devices used in the scientific experiments. The information management system developed for the Material Science application in the VL project and its implementation using Matisse ODBMS supports the efficient storage, handling, integration, and retrieval of such data sets. The MACS component, integrated in the VL environment, provides a comprehensive and friendly environment to scientists of the Material Science application. The user-friendly interfaces that allow the VL users to access the data stored in the Matisse database are now under development. Such query/search component will enable the VL user to search through the data and look for similarities or patterns. A query component that includes sophisticated search commands is being considered and will result in a more powerful tool. For instance, these query tools can be used to extract slices from the data cubes and together with specialized tools perform some calculations (e.g. chemometrics, correlation analysis methods) on these data slices. Additionally, some data mining and knowledge extraction technology should be offered to analyze the large data sets, to process either the raw data generated by different devices from different applications or the processed experiment-results. This tech-
174
A. Frenkel et al.
nology is presently being considered to process, analyze and present the results in such a way that some valuable knowledge can be extracted from the large amount of data. The stored data that will be used may include information about experimental resources, experimental parameters and conditions, and raw or processed results. The easy retrieval and manipulation of the large data sets together with sophisticated data analysis and knowledge extraction tools give the scientists new research possibilities.
References [1]
[2]
[3]
[4]
[5] [6] [7] [8] [9] [10] [11]
[12] [13] [14] [15]
Afsarmanesh, H., Benabdelkader, A., Kaletas, E.C., et al. Towards a Mulit-layer Architecture for Scientific Virtual Laboratories. In 8th International Conference on High Performance Computing and Networking - EuropeHPCN 2000. 2000. Amsterdam, The Netherlands: Springer. Belloum, A., Hendrikse, Z.W., Groep, D.L., et al. The VL Abstract Machine: a Data and Process Handling System on the Grid. In High Performance Computing and Networking Europe, HPCN 2001. 2001. Amsterdam, The Netherlands. Berg, K.J.v.d., Boon, J.J., Pastorova, I., et al., Mass spectrometric methodology for the analysis of highly oxidized diterpenoid acids in Old Master paintings. Journal of Mass Spectrometry, 2000. 35(4): p. 512-533. Eijkel, G.B., Afsarmanesh, H., Groep, D., et al. Mass Spectrometry in the Amsterdam Virtual Laboratory: development of a high-performance platform for meta-data analysis. In 13th Sanibel Conference on Mass Spectrometry: informatics and mass spectrometry. 2001. Sanibel Island, Florida, USA. Foster, I., Kesselman, C., and Tuecke, S., The Anatomy of the Grid: enabling scalable virtual organizations, www.globus.org/research/papers/anatomy.pdf. 2000. Gigaport, Gigaport Homepage (www.gigaport.nl). 2001. Global Grid Forum, http://www.gridforum.org/. 2001. Groep, D., Brand, J.v.d., Bulten, H.J., et al., Analysis of Complex Surfaces in the Virtual Laboratory. 2000, Amsterdam, The Netherlands. Kaletas, E.C., A Java Based Object-Oriented API for the Matisse OODBMS. 2001, University of Amsterdam: Amsterdam. Kaletas, E.C. and Afsarmanesh, H., Virtual Laboratory Experiment Environment Data model. 2001, University of Amsterdam: Amsterdam. Massey, K.D., Kerschberg, L., and Michaels, G. VANILLA: A Dynamic Data Schema for A Generic Scientific Database. In 9th International Conference on Scientific and Statistical Database Management (SSDBM ’97). 1997. Olympia, WA, USA: Institute of Electrical and Electronics Engineers (IEEE). Matisse, Matisse Tutorial. 1998. ODMG, The Object Data Standard: ODMG 3.0. Series in Data Management Systems, ed. Gray, J., et al. 2000: Morgan Kaufmann Publishers, Inc. The CO-IM Group, UvA, http://carol.wins.uva.nl/~netpeer/. The Globus Project, http://www.globus.org/. 2001.