IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 2, MARCH 2008
205
A Semantic Grid Infrastructure Enabling Integrated Access and Analysis of Multilevel Biomedical Data in Support of Postgenomic Clinical Trials on Cancer Manolis Tsiknakis, Member, IEEE, Mathias Brochhausen, Jarek Nabrzyski, Juliusz Pucacki, Stelios G. Sfakianakis, George Potamias, Cristine Desmedt, and Dimitris Kafetzopoulos
Abstract—This paper reports on original results of the Advancing Clinico-Genomic Trials on Cancer integrated project focusing on the design and development of a European biomedical grid infrastructure in support of multicentric, postgenomic clinical trials (CTs) on cancer. Postgenomic CTs use multilevel clinical and genomic data and advanced computational analysis and visualization tools to test hypothesis in trying to identify the molecular reasons for a disease and the stratification of patients in terms of treatment. This paper provides a presentation of the needs of users involved in postgenomic CTs, and presents such needs in the form of scenarios, which drive the requirements engineering phase of the project. Subsequently, the initial architecture specified by the project is presented, and its services are classified and discussed. A key set of such services are those used for wrapping heterogeneous clinical trial management systems and other public biological databases. Also, the main technological challenge, i.e. the design and development of semantically rich grid services is discussed. In achieving such an objective, extensive use of ontologies and metadata are required. The Master Ontology on Cancer, developed by the project, is presented, and our approach to develop the required metadata registries, which provide semantically rich information about available data and computational services, is provided. Finally, a short discussion of the work lying ahead is included. Index Terms—Biomedical grid, cancer, metadata, ontology, postgenomic clinical trials, semantic integration of heterogeneous biomedical databases.
Manuscript received November 3, 2006; revised April 17, 2007. This work was supported in part by the European Union cofunded Reistance Temperature Dectectors (RTD) Advancing Clinico-Genomic Trials on Cancer: Open Grid Services for Improving Medical Knowledge Discovery (ACGT) Project under Grant FP6–2005-IST-026996. The work of C. Desmedt was supported by the Fonds National de la Recherche Scientifique. M. Tsiknakis is with the Foundation for Research and Technology– Hellas, Institute of Computer Science, GR-71110 Heraklion, Greece (e-mail:
[email protected]). M. Brochhausen is with the Institute of Formal Ontologies and Medical Information Science (IFOMIS), University of Saarland, 66041 Saarbr¨ucken, Germany (e-mail:
[email protected]). J. Nabrzyski and J. Pucaski are with the Poznan Supercomputing and Networking Center, 60-967 Poznan, Poland (e-mail:
[email protected];
[email protected]). S. G. Sfakianakis, G. Potamias, and D. Kafetzopoulos are with the Foundation for Research and Technology–Hellas, Institute of Computer Science and Institute of Molecular Biology and Biotechnology, Heraklion, Greece (e-mail:
[email protected];
[email protected];
[email protected]). C. Desmedt is with the Functional Genomics and Translational Research Unit, Jules Bordet Instutute, Brussels 1000, Belgium (e-mail:
[email protected]). Digital Object Identifier 10.1109/TITB.2007.903519
I. INTRODUCTION ECENT advances in research methods and technologies have resulted in an explosion of information and knowledge about cancers and their treatment. Exciting new research on the molecular mechanisms that control cell growth and differentiation has resulted in a quantum leap in our understanding of the fundamental nature of cancer cells, and has suggested valuable new approaches to cancer diagnosis and treatment. The ability to characterize and understand cancer is growing exponentially based on information from genetic and protein studies, clinical trials, and other research endeavors. The breadth and depth of the information already available in the research community at large present an enormous opportunity for improving our ability to reduce mortality from cancer, improve therapies, and meet the demanding individualization-ofcare needs [1], [2]. While these opportunities exist, the lack of a common infrastructure has prevented clinical research institutions from being able to mine and analyze disparate data sources. As a result, very few cross-site studies and multicentric clinical trials are performed, and in most cases, it is not possible to seamlessly integrate multilevel data (from the molecular to the organ and individual levels). Moreover, clinical researchers or molecular biologists often find it hard to exploit each other’s expertise due to the absence of a cooperative environment, which enables the sharing of data, resources, or tools for comparing results and experiments, and a uniform platform supporting the seamless integration and analysis of disease-related data at all levels [2]. This inability to share technologies and data developed by different organizations is, therefore, severely hampering the research process. The vision of the Advancing Clinico-Genomic Trials on Cancer (ACGT) Project (www.eu-acgt.org) [3] is to contribute to the resolution of these problems by developing a semantically rich grid infrastructure in support of multicentric, postgenomic clinical trials (CTs), thus enabling discoveries in the laboratory to be quickly transferred to the clinical management and treatment of patients (see Fig. 1). This paper presents a short background section discussing the urgent needs faced by the biomedical informatics research community; it presents the clinical trials upon which the ACGT project is based, for both gathering and eliciting requirements and also for validating the technological infrastructure designed. It continues with a presentation of the initial ACGT architecture
R
1089-7771/$25.00 © 2008 IEEE
206
Fig. 1. tools.
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 2, MARCH 2008
ACGT semantic grid infrastructure, allowing the creation of dynamic virtual organizations (VOs) and the coordinated and secure sharing of data and
by defining, and presenting its layers and key enabling services. Special emphasis is given to the presentation of the ACGT Master Ontology on Cancer, and its central role in the ACGT architectural framework is discussed in Section VII. The issue of grid intelligence is introduced in a subsequent section, and the need for ontologies and rich metadata for the description and publishing of grid resources for enabling their semantic discovery and integration is presented. Finally, Section IX discusses the status and outlook of the work that needs to be completed for the realization of the vision of the project.
II. BACKGROUND Biologists and computer scientists are working on designing data structures, and on implementing software tools to support biomedicine in decoding the entire human genetic information sequencing. However, many issues are still unsolved. Among the most critical of these are the issues of heterogeneous data integration and metadata definitions [4]. This need for integration is to some extent clear in the case of complex, multifactorial diseases, such as cancer. Cancer is a highly complex and heterogeneous disease, which involves a succession of genetic changes that eventually results in the conversion of normal cells into cancerous ones. It is obvious that a complete understanding and knowledge of these processes requires the integration and analysis of massive amounts of data as is being collected from current genomic, proteomic, and metabolomic platforms. However, it is not only the multiplicity of the factors (and cellular levels) contributing to a particular disease framework that imposes a systematic way approach to the problem. Even for Mendelian genetic disorders, nearly all of which have now been correlated with a specific gene
or set of genes, the relationship between genotype and phenotype is not as simple as expected (and/or currently treated) [5]. Because our knowledge of this domain is still largely rudimentary, investigations are now routinely moving from being “hypothesis-driven” to being “data-driven” with analysis based on a search for biologically relevant patterns. These recent technological advances have created enormous opportunities for accelerating the pace of science. One can now envision the possibility of obtaining a comprehensive picture of the mechanisms underlying the cellular function, its regulation, and the interactions of an organism with its environment [6]. In this context, exploratory analysis is the process of generating hypotheses that are later supported (or not) by the data (e.g., hypothesis: gene x is responsible for a side effect of drug y). The task of validating these hypotheses is done by means of clinical trials. III. POSTGENOMIC CLINICAL TRIALS As stated earlier, the task of validating clinical hypotheses is done by means of clinical trials. The most commonly performed clinical trials evaluate new drugs, medical devices, biologics, or other interventions in patients in strictly scientifically controlled settings, and are required for regulatory authority approval of new therapies. Trials may be designed to assess the safety and efficacy of an experimental therapy, to assess whether the new intervention is better than the standard therapy, or to compare the efficacy of two standard or marketed interventions. The study design that provides the most compelling evidence of a causal relationship between the treatment and the effect is the randomized controlled trial. The number of patients enrolled in the study also has a large bearing on the ability of the trial to reliably detect an effect of
TSIKNAKIS et al.: SEMANTIC GRID INFRASTRUCTURE ENABLING INTEGRATED ACCESS AND ANALYSIS
Fig. 2.
Pharmaceutical R&D process and key bottlenecks.1
a treatment. This is described as the “power” of the trial. It is usually expressed as the probability that, if the treatments differ in their effect on the outcome of interest, the statistical analysis of the trial data will detect that difference. The larger the sample size or number of participants the greater the statistical power. Patient recruitment is often the time-limiting factor for clinical trials. That is why technological support for multicentric CTs is of paramount importance. Many CTs have problems because they cannot gather enough information to draw sound conclusions in a timely manner— this applies not only to the number of patient subjects but also to the lack of links between clinical and genomic patients’ data, i.e., integration of multilevel clinico-genomic data, as reported and analyzed in the reference of Fig. 2. A. ACGT Trials The ACGT project has been structured within such a context. It has selected two cancer domains, and has defined three specific trials. These trials serve a dual purpose. First, they are used for developing a range of postgenomic analytical scenarios for feeding the requirement analysis and elicitation phase of the project, and second, they will be used for the validation of the functionality of the ACGT technologies. The ACGT trials are in the domain of breast cancer and Wilm’s tumor (pediatric nephroblastoma). Specifically: Breast cancer. It is the commonest cancer in women in the world, in both industrialized and developing countries. Breast cancer is both genetically and histopathologically heterogeneous, and the mechanisms underling breast cancer developments remain largely unknown. Much progress has been made over the past decades in our understanding of the epidemiology, clinical course, and basic biology of breast cancer [7]–[10]. Also, several independent groups have conducted comprehensive gene expression profiling studies with the hope of improving upon traditional prognostic markers used in the clinical practice [11]–[13], as risk stratification based on existing guidelines is far from perfect. 1 The innovative medicines initiative (IMI) strategic research agenda. [Online]. Available: http://ec.europa.eu/research/fp6/pdf/innovative medicines sra final draft en.pdf
207
The ACGT Test of Principle (TOP) study aims to identify biological markers associated with pathological complete response to anthracycline therapy (epirubicin), one of the most active drugs used in the breast cancer treatment. To this end, the neoadjuvant approach is very attractive, as it provides an in vivo assessment of treatment sensitivity without affecting survival adversely [14]–[16]. Supported by “in vitro” and preliminary “in vivo” data, this study is designed to prospectively test the value of topo II, α gene amplification and protein overexpression, in predicting the efficacy of anthracyclines. The study has two main biological hypotheses; one for the subgroup of patients with estrogen receptor (ER) negative/tyrosine kinase-type cell surface receptor HER-2 (HER-2) amplified tumors, the other one for the subgroup of patients with ER negative/HER-2 nonamplified tumors. Specifically, two hypotheses are to be tested, which are: 1) patients with ER negative/HER-2 amplified tumors. In this subgroup of patients, topo II α gene will be amplified in about 40% of cases. We hypothesize that, in topo II α amplified tumors, a threefold increase in pathological complete response (pCR) rate will be observed, as opposed to the pCR rate in tumors with topo II α normal or deleted gene; and 2) patients with ER negative/HER-2 nonamplified tumors. In this subset of patients, almost no topo II α gene aberrations will be found based on previous data, discussed in [8]. Wilms’ tumor. Although rare, it is the most common primary renal malignancy in children, and is associated with a number of congenital anomalies and documented syndromes. Appropriate laboratory, radiologic, and pathologic investigations are necessary for accurate diagnosis and subsequent staging; information, which is essential to generate a multidisciplinary treatment plan utilizing surgery, chemotherapy, and radiotherapy. The goal of the current clinical trial is to reduce therapy for children with low-risk tumors, thereby avoiding acute and longterm toxicities. Challenges remain in identifying novel molecular, histological, and clinical risk factors for stratification of treatment intensity. This could allow a safe reduction in therapy for patients known to have an excellent chance of cure with the current therapy, while identifying, at diagnosis, the minority of children at risk of relapse, who will necessitate more aggressive treatments [17]. In silico modeling and simulation. The aim of the third ACGT study is to provide clinicians with a decision-support tool that is able to simulate, within defined reliability limits, the response of a solid tumor to therapeutic interventions based on the individual patient’s multilevel data. The In Silico oncology clinical research is based on the other two clinical trials described previously. Its objective is to develop, clinically adapt, optimize, and validate a computational system, denoted by the specially coined term “oncosimulator,” which will serve as a simulation model of tumor response to chemotherapy. The most critical biological phenomena (e.g., metabolism, cell cycling, geometrical growth or shrinkage of the tumor, cell survival following irradiation or chemotherapeutic treatment, necrosis, apoptosis, etc.) will thus, be spatiotemporally simulated by using a variety of clinical,
208
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 2, MARCH 2008
radiobiological, pharmacodynamic, molecular, and imaging data [18]. B. Main Challenges: Integration and Analysis of Distributed Multilevel Data The main challenges involved in any multicentric, postgenomic CT, once the challenges of patient recruitment in accordance with the CT eligibility criteria have been achieved, is the management of multilevel biomedical data, and its integrated access and analysis. Typical examples are provided by the ACGT trials. Apart from the clinical data for each trial participant, which is captured in case report forms (CRFs) and managed in trial-specific databases, several types of additional datasets are also required. The number and type of such datasets depend on the objectives and the design of each trial. Taking as an example the TOP trial, a range of datasets is generated at various time intervals (defined at the study design phase), such as: Clinical-radiological, physical examination, bilateral breast ultrasonography (US), mammography, hematochemistry survey, electrocardiogram (ECG), chest X-ray, bone scan, liver ultrasound. In addition, gene expression profiling and a range of immunohistochemistry examinations will be performed for the identification of specific markers (e.g., ER, HER2, topoII). Gene sequencing is also required, as well as proteomics analysis (on plasma samples) and metabolomics analysis (for pharmacokinetics). The results from these analyses reside in separate dedicated databases (CT database, histopathological database, institutional or modality-specific image management systems, microarray database, proteomic information systems, etc). In addition, since several clinical research organisations are participating in a given trial, they are also geographically distributed within or across countries. Once datasets are generated, a range of specialized tools is required for the integrated access, processing, analysis, and visualization. Responding to these requirements, by providing the required integration platform, is fundamentally the challenge for ACGT. C. Analytical Scenarios to be Supported by an Integrated Technological Platform Having defined the clinical studies to be implemented through the use of the ACGT grid infrastructure, we then proceeded to the task of capturing the functional requirements from such an infrastructure. To this extent, a range of scenarios has been developed by the ACGT user community as well as several “technology-driven” scenarios, with the purpose of eliciting requirements and guiding specifications. The most complex of such scenarios, presenting the “analytical requirements” of a researcher testing a hypothesis to explain the characteristics of nonresponding patients who were withdrawn from the TOP trial due to adverse reactions to treatment, is presented next. For the realization of the required analytical tasks, users need to be supported by the platform in executing the following
steps, which constitute the “analytical scenario” or the “scientific workflow.” Step 1) Query the distributed and heterogeneous clinical trial databases with the purpose of identifying the TOP trial patients’ cases with inflammatory breast cancer that show less than 50% tumor regression and chromosomal amplification in region 11q, who received less than 1 Epirubicine cycle due to serious adverse event allergy in the clinical trial databases of all cancer centers participating in the clinical trial; Step 2) Exclude those patient cases that show polymorphisms in the specific glucuronidating enzyme of Epirubicin UGT2B7; Step 3) Query the corresponding genomic databases (microarray data) for the preoperative and postoperative gene expression data of these patients; Step 4) Normalize the retrieved data, from all genomic databases participating in the trial, by using a selected transformation method; Step 5) Compare with the shown differential gene expression between preoperative and postoperative data; Step 6) Cluster the identified genes by using an appropriate hierarchical clustering method and tool; Step 7) Visualize the 50 most overexpressed and underexpressed genes; Step 8) Obtain functional annotation for each of these genes from the gene ontology (GO) and GeneBank public databases; Step 9) Identify those genes expressed in B-lymphocytes from public biomedical databases; Step 10) Map those genes into regulatory pathways by using a selected visualization tool; Step 11) Finally, get the literature related to kinases present in pathway A and pathway B, and identify their regulatory factors.
IV. GENERIC REQUIREMENTS FOR THE ACGT MIDDLEWARE From the preceding description of the type and range of user requirements, it is apparent that a truly complex technical infrastructure needs to be developed, if support for such integrated access, analysis, and visualization of multilevel heterogeneous data is to be provided in the context of discovery-driven exploratory analysis. From the users’ point of view, the platform must provide relevant, simple, and intuitive access to information (search and navigation) and to services; provide precise organization of the content independent of sources; allow scientifically relevant data integration and data exchange; provide mechanisms for data capture and annotation; and provide knowledge sharing tools. In addition, it must provide a dynamically evolving set of validated data exploration, analysis, simulation, and modeling services. Finally, it must be consistent with the way community participants work and integrate smoothly in their day-to-day environment.
TSIKNAKIS et al.: SEMANTIC GRID INFRASTRUCTURE ENABLING INTEGRATED ACCESS AND ANALYSIS
The primary services required for routinely supporting such complex postgenomic analytical scenarios, identified thus far, fall into the following categories. 1) Services for the creation and management of dynamic virtual organizations (VOs); a VO is composed by all those geographically distributed research centers participating in a multicentric CT. A new research organization may, at any time, join or withdraw from a clinical trial, thus, the need for dynamic VO management. 2) Data access services, which are dedicated tools that provide access to trial-specific clinical, imaging, and genomic databases as well as public databases. 3) Analytical services, which are dedicated bioinformatics tools, computational analysis tools, simulation tools, visualization tools, etc, wrapped as Web services. 4) Specialized literature-mining services for capturing implicit information about complex processes, as described in literature. 5) Services for forming and executing exploratory analysis, that is, workflow management services, provenance management services. 6) Semantic services, required for discovering appropriate services and workflows, and managing metadata. Of paramount importance is the ACGT Master Ontology and the ACGT Mediator. 7) Security services, in addition to the basic grid security services, such as anonymization and pseudoanonymization. The main challenge of the integrated ACGT architecture is the interoperability of systems, tools, and services that are made available to the users of the ACGT environment with the ultimate goal of secure, transparent, and unobtrusive sharing of data and functionality. V. INITIAL SYSTEM ARCHITECTURE A software architecture is intended to describe the structure of a system in terms of computational components and their interactions, that is, patterns that guide their composition and constraints on these patterns [19]. The ACGT project, in responding to the requirements previously discussed, focuses on the semantically rich problems of dynamic resource discovery, workflow specification, and privacy preserving distributed data mining and knowledge discovery, as well as metadata and provenance management. A detailed analysis of the scientific and functional requirements of the ACGT infrastructure was performed, together with an analysis of current state of the art in terms of technological infrastructure, data resources, data representation and exchange standards, and ontologies. Specifically, on the one hand the my Grid project (www. mygrid.org.uk) focuses on providing support to investigatordriven experiments in silico. In my Grid, local and public data can be computationally evaluated to ask and answer questions in biology. It is less focused on resource sharing, but rather strives to address issues related to semantic complexity of biological data and the applications that process that data. Within its framework, it supports resource discovery and distributed queries.
209
my Grid is a service-based architecture whose core is Web services and Open Grid Services Architecture Data Access and Integration [(OGSA-DAI)—http://www.ogsadai.org.uk/]. The term service-oriented architecture refers to systems structured as networks of loosely coupled communicating services [20]. On the other hand, the cancer Biomedical Informatics Grid (caBIG—https://cabig.nci.nih.gov/) focuses on the creation of a virtual community that shares resources and tackles the key issues of cyber infrastructure. Similar to my Grid, it is an open infrastructure striving to achieve computational semantic interoperability. The caBIG’s infrastructure is also a service-based architecture. Within its framework, it supports resource discovery and distributed queries. Most of the currently identified scenarios in the ACGT project are focused on data access and processing of data, but there are also several scenarios involving demanding computational jobs and 3-D visualization. In order to fulfill the requirements imposed by these scenarios, a heterogeneous, scalable, and flexible environment is needed, and the following technologies that have gained momentum in the recent years, have been considered for adoption: 1) Web services technologies; 2) grid technologies; 3) semantic Web technologies. Although initially separated, these technologies are currently converging in a complementary way. From the technical point of view, the requirements identified can be met by using a distributed/federated, multilayer, serviceoriented ontology-driven architecture. The ACGT project decided to build on open software framework based on Web services (WS)-Resource Framework (WSRF) and Open Grid Services Architecture (OGSA), the de facto standards in grid computing. Building on concepts and technologies from both the grid and Web services communities, OGSA defines uniform exposed service semantics (the grid service); defines standard mechanisms for creating, naming, and discovering transient service instances; provides location transparency and multiple protocol bindings for service instances; and supports integration with underlying native platform facilities. These standards are implemented in the middleware selected, namely Globus Toolkit 4 (GT4) [21]. An overview of the ACGT system-layered architecture is given in Fig. 3, which is shortly presented in the sequel. 1) Common grid infrastructure layer: This layer comprises the basic “grid engine” for accessing remote resources in a grid environment. It provides common interface for grid resources used by higher-level services. 2) Advanced grid middleware layer: This layer comprises advanced grid services, which operate on sets of lowerlevel services to provide more advanced functionality. Examples of such services include the Gridge Toolkit (see Section VI-B), the OGSA-DAI (see Section VI-B.2), and other ACGT-specific services (see Section VII). The data sources, comprising data produced during experiments, data provided by public biological databases, and data coming from the clinical trial databases, are federated to
210
Fig. 3.
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 2, MARCH 2008
ACGT layered initial architecture.
allow easy access to specific information or to data semantically correlated through ontology-based modeling of biological/biomedical databases [22]. 3) Bioinformatics and knowledge discovery services layer: This layer includes all the ACGT-specific services, such as the ACGT Master Ontology (see Section VII-B), the CTs on Cancer Metadata Services, the Semantic Mediation services (see Section VII-A), and the distributed and privacy-preserving data mining and knowledge discovery services. The metadata repository about software components and data sources (i.e., software tools, databases, and data sources) contains information about specific installed resources. 4) User access layer. Applications composition and enactment layer: This layer allows users to realize complex biomedical applications as composition of basic services from the underlying layers exploiting the resources and data provided by research centers forming different CT VOs. Key tools at this layer are the ACGT portal, tools for the browsing of domain ontologies for the search, selection, and location of resources (data access and analytical services) to be used in the composition of applications as well as workflow-based authoring and scheduling of distributed applications on the grid. 5) Security layer: Access rights, security (encryption), and trust buildings are issues to be addressed and solved on this layer, based on system architectural and security analysis [23]. Also, since genetic data is sensitive personal data, which means that strict data protection legislation is applicable, a range of domain-specific security services, such as pseudoanonymization and anonymization services, is required. VI. ACGT GRID LAYERS As discussed earlier, we see the need for two layers of grid services, i.e., the common grid layer and the advanced grid middleware layer. Their main role and services provided are shortly described in the following sections.
Fig. 4.
Primary GT4 components (dashed lines represent “tech previews”).
A. Common Grid Infrastructure A grid-based computational infrastructure couples a wide variety of geographically distributed computational resources, storage systems, data sources, databases, libraries, computational kernels, and presents them as a unified integrated resource that can be shared by communities as they tackle common goals [24]. Grid technology has several distinguishing features. First, as a consequence of the widespread use of the Globus Toolkit in various settings, grid technology is increasingly mature. Grid technology can support virtual communities through sharing of computational and data resources. Access and identity control are fundamental components of the architecture. The technology supports deterministic queries across a distributed, common schema. Its fundamental architecture also supports stateful processes important to the concept of workflows. Essentially, all major grid projects are currently built on protocols and services provided by the Globus Toolkit (http://globus.org) that enables applications to handle distributed heterogeneous computing resources as a single virtual machine. 1) Globus Toolkit 4 and Its Key Services: Globus Toolkit 4 (GT4) is based on a new core infrastructure component compliant with OGSA and is an open source implementation of the WSRF. GT4 makes extensive use of Web services mechanisms to define its interfaces and structure its components. Web services provide flexible, extensible, and widely adopted extensible markup language (XML)-based mechanisms for describing, discovering, and invoking network services; in addition, its document-oriented protocols are well suited to the loosely coupled interactions that many argue are preferable for robust distributed systems. These mechanisms facilitate the development of serviceoriented architectures, systems, and applications structured as communicating services, in which service interfaces are described, operations invoked, access secured, etc., all in uniform ways. GT4 provides a set of grid infrastructure services that
TSIKNAKIS et al.: SEMANTIC GRID INFRASTRUCTURE ENABLING INTEGRATED ACCESS AND ANALYSIS
implement interfaces for managing computational, storage, and other resources [25]. GT4’s range of basic services includes security, data management, execution management, and information services, as seen in Fig. 4.
B. Advanced Grid Middleware 1) Gridge: The essence of the common grid infrastructure layer, located and tightly connected to the network infrastructure and its services, lies in its ability to provide a scalable, secure, robust, and dynamically configurable advanced communication platform and resource sharing. Nevertheless, the complexity of accessing, invoking, and monitoring the low-level Grid services needs to be hidden from nonexpert end users. In achieving this objective, we have decided to deploy and possibly extend Gridge [26]. The Gridge Toolkit is an open source software initiative aimed at helping users to deploy ready-to-use grid middleware services and create productive grid infrastructures. All Gridge Toolkit software components have been integrated and form a consistent distributed system following the same interface specification rules, license, and quality assurance and testing. The Gridge Toolkit consists of the following main tools and services. 1) Grid resource management system (GRMS): GRMS is an open source metascheduling system, which allows developers to build and deploy resource management systems for large-scale distributed computing infrastructures. 2) Grid authorization service (GAS): GAS is an authorization system, which can be the standard authorization decision point for all components of a grid system. 3) Grid mobile services: This class of applications is represented by specialized middleware services for mobile users. 4) Grid data management system (DMS): Data storage, management and access in Gridge environment is supported by the Gridge data management suite (DMS). 5) Grid monitoring system (Mercury): The service is designed to satisfy requirements of grid performance monitoring: it provides monitoring data represented as metrics via both pull and push access semantics. All components are integrated and follow the same interface specification rules, license, quality assurance and testing, distribution, etc. 2) OGSA-DAI: Of particular importance to the ACGT is the OGSA-DAI specification [27], which is an extension to OGSA designed to support data access and integration. The goal of the OGSA-DAI is to provide uniform service interfaces for data access and integration through the grid. Disparate heterogeneous data sources can be treated as a single logical resource using the OGSA-DAI interfaces. The OGSA-DAI is logically considered as a number of cooperating grid services. The three main OGSA-DAI service components are: 1) the grid data service (GDS); 2) the grid data service factory (GDSF); and 4) the service group registry (DAISGR).
211
There are a number of implementations of the OGSA-DAI to relational biomedical databases, but to the best of our knowledge, there is no implementation for accessing Digital Imaging and Communications in Medicine (DICOM)-formatted data (http://medical.nema.org/dicom/2004.html). Since access to imaging data is crucially important in most ACGT scenarios, we have concentrated our initial efforts in developing an OGSA-DAI implementation for querying a DICOM-compliant image management server. A first implementation of this data access service is available providing the ability to query a DICOM-compliant image server using the SPARQL (http://www.w3.org/TR/rdfsparql-query/) as the query language. The use of SPARQL, an extension of RDQL, has been investigated as a query language (http://www.w3.org/Submission/2004/SUBMRDQL-20040109/), which satisfies the requirements of the mediator (see Section VIII). The main limitation of the SPARQL is that it does not support aggregation of data. Or, more generally, it cannot return any derived data, which is not literally in the source data set. As DICOM does not support aggregation either, it seems that the SPARQL is well suited for expressing any valid DICOM query. A typical SPARQL query is shown next, which would return the identifiers of all imaging studies for the patient(s) with name “Smith.” PREFIX dicom: SELECT ?studyId ?seriesNum ?instanceNum WHERE { ?patient dicom:PatientsName ‘‘Smith’’. ?study dicom:Patient ?patient; dicom:StudyID ?studyId. ?series dicom:Study ?study; dicom:SeriesNumber ?seriesNum. ?image dicom:Series ?series; dicom:InstanceNumber ?instanceNum. } 3) OGSA-WebDB: Also, of vital importance for the ACGT user community is the need to provide programmatic, seamless access to public biological databases. In fulfilling this requirement, the ACGT utilizes the OGSA-WebDB specification [28]. The OGSA-WebDB was developed to enable existing Web database resources as OGSA ready. A Web database is a collection of data that is searchable through a Web-accessible search interface. The search interface usually supports keyword-based search combined with Boolean operators. The OGSA-WebDB extends the OGSA-DAI such that grid clients can not only integrate and access data stored in relational and XML database management systems but can also integrate and access the large amount of data available on Web-accessible biological databases. The OGSA-WebDB consists of two main components: proxy databases and a mediator. Proxy databases are databases that are already on the grid, i.e., they can be accessed by the OGSA-DAI/grid clients. They are used to interface the web
212
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 2, MARCH 2008
databases. Each Web database is represented by a proxy relation within a proxy database. Therefore, any grid client wishing to access a Web database will access its proxy relation instead. The mediator accepts a query from the grid clients and transforms the query into one or more Boolean conditions that are then sent to the target Web database. When the results are returned from the web database, the mediator inserts the results into the appropriate proxy relations of the Web database. Then, the grid clients can access the results using the proxy relations. VII. BIOINFORMATICS AND KNOWLEDGE DISCOVERY LAYER The bioinformatics and knowledge discovery services layer includes all ACGT-specific services that provide either analytical capabilities or support the integrated knowledge discovery process. Knowledge discovery has been described as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [29]. Two core services on this layer are the mediator and the master ontology. The former offers integrated query services, while the latter provides the necessary semantic background during the integration process. We will briefly describe these in the following. A. ACGT Mediator The basic role of the mediator within the ACGT environment is to provide users with a powerful tool for integrated access and retrieval of data from distributed and heterogeneous database systems. The mediator acts as a service for knowledge discovery tools, and as a client for database wrappers (i.e., the OGSA-DAI and the OGSA-WebDB). While following the state of the art in the area, we have adopted an innovative approach to semantic mediation. In this approach, the user performs queries against a single “virtual” repository. This virtual repository represents the integration of several heterogeneous sources of information. This integration process relies on a common interoperability infrastructure, based at a conceptual level on domain ontology. In the ACGT, we apply a local as view (LAV) approach to schema mediation [30]. In this approach, there preexists a global schema. Local schemata to be integrated are mapped to the global schema so that local schema elements are completely expressed in terms of the global schema. This requires a global schema powerful enough to cover the semantics in the local schemata. The approach works well if the local semantics are predictable, and as long as amendments to the global schema can be made in an upward-compatible way. The advantage of this approach is the tight integration and guaranteed powerful capabilities of reasoning on the integrated data, in particular, joins across data from all different sources. In the ACGT, the global schema is a subset of the ACGT Master Ontology. In more detail, each source has been wrapped, exposing to the mediator the local schema in terms of concepts of the Master Ontology, i.e., as a view of it. In other terms, the local source appears under a virtual schema by virtue of the wrapper, which is a view of the Master Ontology.
B. ACGT Master Ontology As discussed previously, the ACGT seeks to provide complex data querying and mediation functionality for the ACGT grid infrastructure. For this task, a master ontology is required so as to provide the foundations for semantic data integration, as already discussed in the Section III. One of the definitions of ontology most often cited in informatics reads as follows: “An ontology is a formal, explicit specification of a shared conceptualization” [31]. However, this definition cannot prevent the existence of a multitude of noninteroperable ontologies, a fact that represents one of the main issues addressed by the project with its master ontology. The aim is to provide a common domain ontology for the cancer research and management, in order to avoid case-by-case resolutions. In both the medical and the biological domain, a large number of ontologies, terminologies, and databases have surfaced in recent years. Among these, the Foundational Model of Anatomy (FMA) [32] and the GO [33] are of special interest to us, since they provide systematizations crucial for cancer research and management on a high theoretical standard. The FMA is a computer-based knowledge source for bioinformatics, which is concerned with the symbolic modeling of the structure of the human body in a form that is understandable to humans, and is also navigable by machine-based systems. It presents a domain ontology for human anatomy. GO addresses the need for consistent descriptions of gene products in different databases. The GO project has developed three ontologies that describe gene products in terms of their associated biological processes, cellular components, and molecular functions in a speciesindependent manner. Other existing ontologies or taxonomies are the Systemized Nomenclature of Medicine—Clinical Terms (Snomed CT—http://www.snomed.org/snomedct/) and the National Cancer Institute (NCI) Thesaurus (http://nciterms. nci.nih.gov/NCIBrowser/Dictionary.de), which provides a large vocabulary of cancer, including cancer-related diseases. It represents, in our view, the most consistent existing terminology in the cancer domain. Besides these systems, a number of special ontologies or vocabularies including cancer-related entities and domains exist, which cannot be described here extensively. However, recent research has shown that most of the existing projects show flaws and are logically or ontologically unsound [34]. Even before the ACGT project started, the Institute of Formal Ontologies and Medical Information Science (IFOMIS) was active in developing ontological solutions within the cancer domain. These efforts resulted in an ontology of colon carcinoma. In the process of this research, a reference ontology was developed that integrated domain ontologies from anatomy, physiology, and pathology. This reference ontology is called the Ontology of Biomedical Reality (OBR) [35]. It demonstrates that integration is not only needed among medicine and different sciences, but it has to be achieved within the medical domain itself. However useful this system might be, it cannot be the basis for the ACGT Master Ontology for the following reasons. The ACGT Master Ontology is an ontology of cancer research and management with the objective of enabling semantic data
TSIKNAKIS et al.: SEMANTIC GRID INFRASTRUCTURE ENABLING INTEGRATED ACCESS AND ANALYSIS
Fig. 5.
213
The Basic Formal Ontology (BFO) system (taken from http://www.ifomis.unisaarland.de/bfo/gfx/bfo-left-right.png).
integration. The ontology is being built from a realist point of view and, hence will not deal with concepts but rather with universals or classes. The method employed to create the ontology is the application of philosophical and logical principles. We are aiming to create a common reference that is both, humanunderstandable and machine-understandable. In order to achieve this, it is inevitable to integrate state-of-the-art knowledge from the medical domain, which will be provided by the clinical and biomedical institutions in the ACGT consortium. Therefore, the ACGT Master Ontology is being developed to include formal is-a, value restrictions, general logical constraints and relations. As a result, the ACGT Master Ontology covers all criteria of ontologies, as presented in [36].
It is, we believe, obvious that a domain ontology for cancer research and management will inevitably contain entities from a wide range of topics, from the genetic and medical field to the administrative field (e.g., participation in a study) or the legal domain (e.g., consent). As a result, the OBR that includes tumor pathologies, is not able to provide the top level for the ACGT approach, since the top entity of the OBR is a “biological entity.” Also, the emphasis of the OBR is far more anatomical than it would have been useful for the ACGT project. Nevertheless, the OBR inspired some aspects in the ACGT Master Ontology, e.g., the ontology of neoplasms. Since the OBR was not a possible solution for the top level of the ACGT Master Ontology, an alternative was sought and
214
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 2, MARCH 2008
found: the Basic Formal Ontology (BFO), shown in Fig. 5. The BFO has been developed at IFOMIS and at the University of Buffalo. Methodologically BFO is based on the following four principles: 1) realism—reality exists independent of our representations; 2) fallibilism—scientific theories can be subject to revisions; 3) perpectivalism—there are plural legitimate perspectives on reality; 4) adequatism—no reduction of the different perspectives. A central feature of the BFO is the basic dichotomy between continuants and occurents which emphasizes two distinct modes of existence in time that is sometimes referred to as the snapshot (SNAP)-SPAN ontology [37]. Continuants are entities that exist over time, they are objects of change (e.g., human being, tumor, molecule), whereas occurents are entities existing in time, they are change itself (e.g., breathing, growth, cell division). Entities that seem to be closely related from the clinical point of view, might be essentially different from the ontological point of view, e.g., “Neoplasm” is a continuant, whereas “TumorStage” is an occurant. Furthermore, the BFO is currently available in an OWL-DL implementation, which increases the possibilities of syntactical integration and reasoning. Choosing a coherent and logically consistent top level for the ontology is a highly important step. Any systematization of the world (or of any given domain) has to start with basic ideas on what entities exist, or which are the criteria to categorize the elements of reality on a basic level. This “top down” part of ontology development is vitally important in order to come to common terms and principles. As we have seen, the major distinction in BFO is based on how entities are related to time. The first step of adding entities from clinical practice in the ontology was the integration of clinical report forms (CRFs) into the system. The CRFs represent data from the different data types in the ACGT domain, except, in most cases, molecular data. In an approach that could be called a “bottom up,” universals were edited to which patient data refers. The development of the ACGT Master Ontology started with the trials selected, but it is easy to extend its scope, once the first steps in categorization are taken. The existence of a coherent top level ensures the reusability of the ontology, since it prevents the development of ontologies based on a top level, which is restricted to one specific domain. An important issue that has to be addressed by any project that relies on a reference ontology is the problem of quality management regarding the ontology. It is our goal and ambition to make the ACGT Master Ontology a member of the Open Biomedical Ontologies (OBO) Foundry (http://www.obofoundry.org/). The OBO Foundry is a library of interoperable reference ontologies for the biomedical sector, which all subscribe to the same quality standards. All ontologies in the foundry are open source. Among the members of OBO Foundry are the FMA and the GO. Thus, we aim to integrate these systems when we need entities they contain. If we are successful in becoming a member of the OBO Foundry, the ACGT Master Ontology will be among the best ontologies in the biomedical domain. Thus, the ACGT
contributes to the global efforts to build ontology-based health care systems and data integration for biomedicine. VIII. BIOMEDICAL GRID INTELLIGENCE The way how data at different levels of the grid can be effectively acquired, represented, exchanged, integrated, and converted into useful knowledge is an emerging research field known as “grid intelligence” [38]. The term indicates the convergence of Web service, grid and semantic Web technologies, and in particular, the use of ontologies and metadata as basic elements through which intelligent grid services can be developed [39]. An example of this convergence is the semantic grid that came into existence as an effort to introduce the semantic Web technologies into the grids, and it is usually defined as “an extension of the current grid, in which information and services are given well-defined meaning, better enabling computers and people to work in cooperation.” The semantic grid focuses on the systematic adoption of metadata and ontologies to describe grid resources, to enhance and automate service discovery and negotiation, application composition, information extraction, and knowledge discovery [40], whereas knowledge grids [41] offer high-level tools and techniques for the distributed mining and extraction of knowledge from data repositories available on the Grid, leveraging semantic descriptions of components and data, as provided by the semantic grid, and offering knowledge discovery services. We see our main future research challenges in the ACGT, the requirement to develop an infrastructure that is able to produce, use, and deploy knowledge as a basic element of advanced applications, which will mainly constitute a biomedical knowledge grid. In achieving such an objective, the metadata is critical. We use the OWL-S [42] for developing metadata and ontologies for describing grid services so that they might be discovered, explained, composed, and executed automatically. Our initial investigations have also revealed the need for a sophisticated model of provenance, since the use of elementary workflows as well as advanced workflows, i.e., workflows containing other workflows, is becoming a very important goal in our R&D work. This requirement is also pushing for complex metadata about workflows to be maintained in the ACGT Grid middleware. A. Data, Service, and Workflow Metadata Integration of applications and services requires substantial metainformation on algorithms and input/output formats if tools are supposed to interoperate. Furthermore, assembly of tools into complex discovery workflows will only be possible if data formats are compatible, and semantic relationships between objects shared or transferred in workflows are clear. 1) Metadata on Data Sources: In a “grid-enabled” data sharing VO, datasets may not be well known among all participants of the VO. To integrate the highly fragmented and isolated data sources, we need semantics to answer higher-level questions. Therefore, it becomes critically important to describe the context in which the data was captured. We describe this contextualization of the data as “metadata” (data about data). Semantic
TSIKNAKIS et al.: SEMANTIC GRID INFRASTRUCTURE ENABLING INTEGRATED ACCESS AND ANALYSIS
integration, therefore, in the ACGT relies on metadata publishing and ontologies. A similar approach is reported by other initiatives. The my Grid project relies heavily on the use of semantic descriptions to allow more precise searching by both people and machines and workflows [43]. Also, associating appropriate metadata to the grid services has been vitally important to the caGrid, which is the grid infrastructure of caBIG. This is realized by adding contextual metadata around service data elements (SDEs) allowing the realization of the discovery use case (see cabig.nci.nih.gov/caBIG/overview/). Our initial analysis, in the context of the ACGT operational framework, also reveals the need for such rich metadata descriptors. Examples of this data may be research groups participating in a CT and publishing the data sets, data types that are being exposed, analytical tools that are published, the input data format required by these tools and the output data produced, and so forth. Some types of metadata that have been identified so far are as follows. 1) Contact Info: Contact info and other administrative data about a site participating in a CT, which shares information on the grid. 2) Data Type: The data type that a site is exposing and the context upon which this data was generated. 3) Data Collection Method: This would include the name of the technique or the platform that was used to perform the analysis (e.g., Affymetrix), its model, and software version, etc. 4) Ontological Category: An ontological category describes a particular concept that the dataset exposes or a tool operates upon. 2) Analytical Services Metadata: Similarly, the identified analytical services’ metadata descriptions fall into the following categories. 1) task performed by the service; i.e., the typology of the analytical data analysis process (e.g., feature/gene selection, sample/patient categorization, survival analysis, etc.) 2) steps composing the task and the order in which the steps should be executed; 3) method used to perform an analytical/bioinformatics task; 4) algorithm implemented by the service; 5) input data on which the service works; 6) kind of output produced by the service. The ultimate challenge is to achieve the implementation of semantically aware grid services. In achieving this objective, a service ontology is being developed to provide a single point of reference for these concepts and to support description logic reasoning of concept expressions. IX. STATUS AND OUTLOOK In this paper, we consider a world where biomedical software modules and data can be detected and composed to define problem-dependent applications. We wish to provide an environment allowing biomedical researchers to search and compose bioinformatics and other analytical software tools for solving biomedical problems. We focus on semantic modeling of the goals and requirements of such applications using ontologies.
215
The infrastructure that has been developed uses a common set of services and service registrations for the entire CT on cancer community. The shared ACGT semantic services provide biomedical ontologies in common use across clinical trials and cancer research. The project is in its initial phases of implementation with more than 3 years of implementation remaining. We are currently focusing on the development of the core set of components up to a stage where they can effectively support in silico investigation and initial prototypes have been useful in crystallizing requirements for semantics within e-Science. The selected demonstrators together with the core set of components described, will enable us to both begin evaluation and gather additional and more concrete requirements from our users. These will allow us to improve and refine the initial architecture and its services. In addition, our next phase of the work focuses on knowledge discovery in clinoco-genomic data, which exhibits challenges for workflow management and execution. We propose the combination of several grid techniques in order to provide an interactive, easy-to-use, yet expressive environment that supports the needs of domain experts, i.e., clinicians and biologists. These techniques will be further extended and realized in the course of the project taking into account the prior art in the relevant fields of workflow management and enactment and the semantic composition of services while following the user requirements’ elicitation process. ACKNOWLEDGMENT The authors would like to thank all members of the ACGT Consortium who are actively contributing to addressing the R&D challenges faced. REFERENCES [1] K. Yurkewicz. (2006, Jun. 21). Accelerating cancer research. Sci. Grid. [Online]. Available: http://www.interactions.org/sgtw/2006/0621/cabig_ more.html [2] K. H. Buetow, “Cyberinfrastructure: Empowering a “third way” in biomedical research,” Sci. Mag., vol. 308, no. 5723, pp. 821–824, 2005. [3] M. Tsiknakis, D. Kafetzopoulos, G. Potamias, C. Marias, A. Analyti, and A. Manganas, “Developing European biomedical grid on cancer: The ACGT integrated project,” in Proc. HealthGrid 2006 Conf., Valencia, Spain, Stud Health Technol. Inf., vol. 120, pp. 247–58, Jun. 6–8. [4] M. Cannataro, C. Comito, F. L. Schiavo, and P. Veltri, “Proteus, a grid based problem solving environment for bioinformatics: Architecture and experiments,” IEEE Comput. Intell. Bull., vol. 3, no. 1, pp. 7–18, Feb. 2004. [5] K. Lai and M. Klapa, “Alternative pathways of galactose assimilation: Could inverse metabolic engineering provide an alternative to galactosemic patients?,” Metab. Eng., vol. 6, pp. 239–244, 2004. [6] M. Klapa and J. Quachenbush, “The quest of the mechanisms of life,” Biotechnol. Bioeng., vol. 84, pp. 739–742, 2003. [7] T. Sørlie, R. Tibshirani, J. Parker, T. Hastie, J. S. Marron, A. Nobel, S. Deng, H. Johnsen, R. Pesich, S. Geisler, J. Demeter, C. M. Perou, P. E. Lønning, P. O. Brown, A. Børresen-Dale, and D. Botstein, “Repeated observation of breast tumor subtypes in independent gene expression data sets,” Proc. Natl. Acad. Sci. USA, vol. 100, no. 14, pp. 8418–23, Jul. 2003. [8] C. Sotiriou, S. Y. Neo, L. M. McShane, E. L. Korn, P. M. Long, A. Jazaeri, P. Martiat, S. B. Fox, A. L. Harris, and E. T. Liu, “Breast cancer classification and prognosis based on gene expression profiles from a population-based study,” Proc. Natl. Acad. Sci. USA, vol. 100, no. 18, pp. 10393–10398, Sep. 2003.
216
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 2, MARCH 2008
[9] Z. Hu, C. Fan, D. S. Oh, J. S. Marron, X. He, B. F. Qaqish, C. Livasy, L. A. Carey, E. Reynolds, L. Dressler, A. Nobel, J. Parker, M. G. Ewend, L. R. Sawyer, J. Wu, Y. Liu, R. Nanda, M. Tretiakova, A. Ruiz Orrico, D. Dreher, J. P. Palazzo, L. Perreard, E. Nelson, M. Mone, H. Hansen, M. Mullins, J. F. Quackenbush, M. J. Ellis, O. I. Olopade, P. S. Bernard, and C. M. Perou, “The molecular portraits of breast tumors are conserved across microarray platforms,” BMC Genomics, vol. 27, no. 7, p. 96, Apr. 2006. [10] P. Farmer, H. Bonnefoi, V. Becette, M. Tubiana-Hulin, P. Fumoleau, D. Larsimont, G. MacGrogan, J. Bergh, D. Cameron, D. Goldstein, S. Duss, A. L. Nicoulaz, M. Fiche, C. Brisken, M. Delorenzi, and R. Iggo, “Identification of molecular apocrine breast tumours by microarray analysis,” Oncogene, vol. 24, no. 29, pp. 4660–4671, Jul. 2005. [11] L. J. van’t Veer, H. Y. Dai, M. J. van de Vijver, Y. D. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend, “Gene expression profiling predicts clinical outcome of breast cancer,” Nature, vol. 415, no. 6871, pp. 530–536, Jan. 2002. [12] Y. Wang, J. G. Klijn, Y. Zhang, Y. Zhang, A. M. Sieuwerts, M. P. Look, F. Yang, D. Talantov, and J. A. Foekens, “Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer,” Lancet, vol. 365, no. 9460, pp. 671–679, Feb. 2005. [13] J. A. Fockens, D. Atkins, Y. Zhang, F. C. Sweep, N. Harbeck, A. Paradiso, T. Cufer, A. M. Sieuwerts, D. Talantov, P. N. Span, V. C. Tjan-Heijnen, A. F. Zito, K. Specht, H. Hoefler, R. Golouh, F. Schittulli, M. Schmitt, L. V. Beex, J. G. Klijn, and Y. Wang, “Multicenter validation of a gene expression-based prognostic signature in lymph node-negative primary breast cancer,” J. Clin. Oncol., vol. 10, no. 24, pp. 1665–1671, Apr. 2006. [14] C. Fan, D. S. Oh, L. Wessels, B. Weigelt, D. S. Nuyten, A. B. Nobel, L. J. van’t Veer, and C. M. Perou, “Concordance among gene-expressionbased predictors for breast cancer,” N. Engl. J. Med., vol. 355, pp. 560– 569, 2006. [15] J. A. O’Shaughnessy, “Molecular signatures predict outcome of breast cancer,” N. Engl. J. Med., vol. 355, pp. 615–617, 2006. [16] S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, J. M. Cronin, F. L. Baehner, M. G. Walker, D. Watson, T. Park, W. Hiller, E. R. Fisher, D. L. Wickerham, J. Bryant, and N. Wolmark, “A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer,” N. Engl. J. Med., vol. 351, pp. 2817–2826, 2006. [17] P. D¨onnes, A. H¨oglund, M. Sturm, N. Comtesse, C. Backes, E. Meese, O. Kohlbacher, and H. P. Lenhof, “Integrative analysis of cancer-related data using CAP,” FASEB J., vol. 18, pp. 1465–1467, 2004. [18] G. S. Stamatakos, D. D. Dionysiou, E. I. Zacharaki, N. A. Mouravliansky, K. Nikita, and N. Uzunoglu, “In silico radiation oncology: Combining novel simulation algorithms with current visualization techniques,” Proc. IEEE, vol. 90, no. 11, pp. 1764–1777, Nov. 2002. [19] M. Shaw and D. Garlan, Software Architecture: Perspectives on an Emerging Discipline. Englewood Cliffs, NJ: Prentice-Hall, 1996. [20] W3C working draft. (2003, May 14). Web services architecture. [Online]. Available: http://www.w3.org/TR/2003/WD-ws-arch-20030514/ [21] I. Foster, “Globus toolkit version 4: Software for service-oriented systems,” in Proc. IFIP Int. Conf. Netw. Parallel Comput. (Lecturer Note in Computer Science 3779), Springer-Verlag, 2006, pp. 2–13. [22] C. F. Taylor, “A systematic approach to modelling capturing and disseminating proteomics experimental data,” Nat. Biotechnol., vol. 21, pp. 247–254, 2003. [23] Globus Toolkit 4.0: Security Available: http://www.globus.org/toolkit/ docs/4.0/security/. Accessed Nov. 2007. [24] I. Foster, “Service oriented science,” Science, vol. 308, no. 5723, pp. 814– 817, May 2005. [25] I. Foster and S. Tuecke, “Describing the elephant: The different faces of it as service,” ACM Queue, vol. 3, no. 6, pp. 26–29, 2005. [26] The Gridge Toolkit. Available: http://www.gridge.org. Accessed Nov. 2007. [27] The OGSA-DAI project. Available: http://www.ogsadai.org.uk/. Accessed Nov. 2007. [28] M. P. Said and I. Kojima, “OGSA-Webdb: An OGSA-based system for bringing web databases into the grid,” J. Digit. Inf. Manag., vol. 2, no. 2, pp. 48–53, Jun. 2004. [29] R. Studer, V. R. Benjamins, and D. Fensel, “Knowledge engineering: Principles and methods,” IEEE Digit. Inf. Manag., vol. 25, no. 1/2, pp. 161– 197, Mar. 1998. [30] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery: An overview,” in Advances in Knowledge Discov-
[31] [32] [33] [34] [35] [36] [37] [38] [39] [40]
[41] [42] [43]
ery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. Menlo Park, CA: AAAI Press, 1996, pp. 1–30. B. Smith, “Ontology,” in Blackwell Guide to the Philosophy of Computing and Information, L. Floridi, Ed. Oxford, U.K.: Blackwell, 2003, pp. 155–166. Foundation Model of Auatomy (FMA). Available: http://sig.biostr. washington.edu/projects/fm/AboutFM.html. Accessed Nov. 2007. The Gene Ontology (GO). Available: http://geneontology.org/. Accessed Nov. 2007. C. Rosse, A. Kumar, J. L. V. Mejino, Jr., D. L. Cook, L. Detwiler, and B. Smith, “A strategy for improving and integrating biomedical ontologies,” in Proc. AMIA Symp. 2005, Washington, DC, pp. 639–643. A. Kumar, Y. L. Yip, B. Smith, D. Marwede, and D. Novotny, “An ontology for carcinoma classification for clinical bioinformatics,” in Proc. MIE 2005, Geneva, Switzerland, pp. 635–640. P. Grenon, B. Smith, and L. Goldberg, “Biodynamic ontology: Applying BFO in the biomedical domain,” in Ontologies Med., D. M. Pisanelli, Ed. Amsterdam, The Netherlands: IOS Press, 2004, pp. 20–38. N. Zhong and J. Liu, Eds., Intelligent Technologies for Information Analysis. New York: Springer-Verlag, 2004, pp. 615–659. T. R. Gruber, “A translation approach to portable ontologies,” Knowl. Acquisition, vol. 5, no. 2, pp. 199–220, 1993. J. Kohler, S. Philippi, and M. Lange, “Semeda: Ontology based semantic integration of biological databases,” Bioinformatics, vol. 19, no. 18, pp. 2420–2427, Dec. 2003. D. D. Roure, N. R. Jennings, and N. Shadbolt, “The semantic grid: A future e-science infrastructure,” in Grid Computing: Making the Global Infrastructure a Reality, F. Berman, A. J. G. Hey, and G. E Fox, Eds. Hoboken, NJ: Wiley, 2003, pp. 437–470. M. Cannataro and D. Talia, “Knowledge grid—An architecture for distributed knowledge discovery,” Commun. ACM, vol. 46, no. 1, pp. 89–93, 2003. Web Service Ontology. Available: http://www.ai.sri.com/daml/services/ owl-s/. Accessed Nov. 2007. C. Wroe, R. Stevens, C. Goble, A. Roberts, and M. Greenwood, “A suite of DAML+OIL ontologies to describe bioinformatics web services and data,” Int. J. Coop. Inf. Syst., vol. 12, no. 2, pp. 197–224, 2003.
Manolis Tsiknakis (M’92) received the B.Eng. degree in electronic engineering, M.Sc. degree in microprocessor engineering, and Ph.D. degree in control systems engineering from the University of Bradford, Bradford, U.K., in 1983, 1985, and 1989, respectively. In 1992, he joined the Foundation for Research and Technology–Hellas (FORTH), Heraklion, Greece, where he is currently a Principal Researcher in many collaborative R&D projects, the Head of the Center of eHealth Technologies, and also the Scientific Coordinator of the Advancing Clinico-Genomic Trials on Cancer (ACGT) integrated project. He is the initiator and Chair of the European Research Consortium for Informatics and Mathematics (ERCIM) Biomedical Informatics Working Group. His current research interests include biomedical informatics, component-based software engineering, information integration, ambient intelligence in eHealth and mHealth service platforms, and signal processing and analysis. Dr. Tsiknakis is a member of the Programme Committee of the HealthGRID from 2003 to 2005 and the IEEE Computer-Based Medical Systems (CBMS) 2005 International Conferences. He is also a member of the Management Board of HL7-Hellas and the Association for Computing Machinery.
Mathias Brochhausen received the M.A. degree in philosophy, physical anthropology, and sociology and Ph.D. degree in philosophy from the University of Mainz, Mainz, Germany, in 2000 and 2004, respectively. He is currently a Researcher at the Institute of Formal Ontology and Medical Information Science (IFOMIS), Saarbr¨ucken, Germany. His current research interests include philosophy of biology and medicine, biomedical ontologies, and ontologies in paleoanthropology. He is also teaching philosophy and theory of medicine at Saarland University, Saarbr¨ucken, Germany.
TSIKNAKIS et al.: SEMANTIC GRID INFRASTRUCTURE ENABLING INTEGRATED ACCESS AND ANALYSIS
Jarek Nabrzyski received the M.Sc. and Ph.D. degrees in computer science from the Poznan University of Technology, Poznan, Poland, in 1993 and 2000, respectively. He is currently a Researcher at the Poznan Supercomputing and Networking Center (PSNC), Poznan, Poland, where he heads the Applications Department. His current research interests include knowledge-based multiobjective project scheduling, and resource management for parallel and distributed computing. In 2002–2005, he managed the European GridLab project, in which he was one of the Principal Investigators, responsible for such areas as resource management, security, and mobile user support. He is also involved in a number of projects, including e.g., Advancing Clinico-Genomic Trials on Cancer (ACGT), InteliGrid, GridCoord, Bescherming Rechten Entertainment Industrie Nederland (BREIN), QosCosGrid, Challengers, BeInGrid, Open Middleware Infrastructure Institute (OMII)Europe. Dr. Nabrzyski is a member of several advisory boards, including projects such as Akogrimo, CoreGrid, and Ubiquitous Computing and Monitoring System (UCoMS) (USA). He is a co-founder of the European Grid Forum and the Global Grid Forum. He is also a member of the Korea Institute of Science and Technology (KISTI) Supercomputing Center Advisory Board (Korea).
Juliusz Pukacki received the M.Sc. degree in computer science from the Poznan University of Technology (Parallel and Distributed Computating), Poznan, Poland, in 1998. Since 1998, he has been working with the Poznan Supercomputing and Networking Center, Poznan, Poland. At first, he was employed in Network Services Department as a Programmer of Internet and Intranet Services. Next, he joined the Application Department and started to work in the Grid computing area. He is currently with the Jarek Nabrzyski team, where he works on solutions for resource management in the grid environment. He has been involved in a number of grid projects. One of the most important was GridLab Project, where he was the leader of a work package responsible for resource management. He is still leading a development of Grid Resource Management System (GRMS)—metascheduler for grid environments. Other projects are the Advancing Clinico-Genomic Trials on Cancer (ACGT), Inteligrid, high performance computing (HPC)-Europa, and national ones: Progress, Silicon Graphics Image (SGI) Grid, and Clusterix.
Stelios G. Sfakianakis was born in Heraklion, Greece, in 1974. He received the B.Sc. degree in computer science and M.Sc. degree (with highest distinction) in advanced information systems from the University of Athens, Athens, Greece, in 1995 and 1998, respectively. Since January 2000, he has been with the Foundation for Research and Technology–Hellas, Institute of Computer Science and Institute of Molecular Biology and Biotechnology, Heraklion, Greece. His current research interests include the design and implementation of a component-based architecture for the Integrated Electronic Health Record using the Common Object Request Broker Architecture (CORBA) and Web Services as the middleware technology, the design of relational or hierarchical databases to visual modeling with the Unified Modeling Language (UML), open source operating systems, and distributed object and service-oriented computing.
217
George Potamias received the B.Sc. degree in mathematics and Ph.D. degree in artificial intelligence from the University of Patras, Patras, Greece, in 1983 and 1991, respectively. In 1992, he joined the Foundation for Research and Technology–Hellas, Institute of Computer Science (FORTH-ICS), Heraklion, Greece, where he is currently a Principal Researcher leading the data-mining and knowledge discovery activity of the biomedical informatics laboratory. He contributed to several European R&D Projects, and is involved in the INFOBIOMED FP6-IST NoE Project on biomedical informatics and LOCCANDIA FP6-IST Strep Project on lab-on-chip nanotechnologies for diagnostics. His current research interests include the development and customization of datamining algorithms, tools and systems, and their utilization in the biomedical domain.
Christine Desmedt received the Bio-Eng. degree in cells and genes biotechnology from the Catholic University of Leuven, Leuven, Belgium, in 2000, and the M.S. degree in biomedical sciences from the Free University of Brussels, Brussels, Belgium, in 2004. Since 2000, she has been working with the Jules Bordet Institute, Brussels, Belgium, an autonomous comprehensive cancer center devoted entirely to the fight against cancer. She was a Clinical Monitor for the Breast European Adjuvant Studies Group (Br. E. A. S. T), co-coordinating the monitoring activities of external groups for the conduct of breast cancer trials. In 2002, she joined the Functional Genomics and Translational Research Unit, Jules Bordet Institute, where she is currently coordinating different research programs. Her current research interests include the identification and validation of prognostic and predictive markers in breast cancer, as well as a better characterization of breast cancer development and metastasis.
Dimitris Kafetzopoulos received the Ph.D. degree in applied biology and biotechnology from the Department of Biology, University of Crete, Crete, Greece. He studied biology at the University of Thessaloniki, Thessaloniki, Greece, and biochemistry at the Graduate School of the University of Toronto, Toronto, ON, Canada. During his Postdoctoral Research as the European Molecular Biology Organization (EMBO) and the Human Frontier Science Program Organization (HFSPO) Fellow at the University of Leiden, Leiden, The Netherlands, he studied molecular signaling in plant–microbe interactions. Since 1997, he has been a Researcher at the Institute of Molecular Biology and Biotechnology (IMBB), Foundation for Research and Technology–Hellas (FORTH), Heraklion, Greece, leading the research group of postgenomic applications. His current research interests include drug development methodologies, molecular classification using DNA microarrays, and multianalyte approaches in genotyping. Mr. Kafetzopoulos was the recipient of a prize for his inventive and innovative research from the Greek Patent Office in 2002.