ProGenGrid: a Grid-enabled platform for ... - Semantic Scholar

5 downloads 431 Views 1MB Size Report
database. Clinical database. Genome database … … WorkFlow. Generic Grid Services. Data Grid Services ... defines workflow as: “The automation of a business process, in whole or part, during .... Post-Marketing. Testing. $600-700 Million!
ProGenGrid: a Grid-enabled platform for Bioinformatics G. Aloisio, M. Cafaro, S. Fiore, M. Mirto

CACT/ISUFI and SPACI Consortium, University of Lecce, Italy HealthGrid 2005 7th-9th April, Oxford

OUTLINE • Bioinformatics: some issues • Why Bioinformatics Grid? • The Proteomics and Genomics Grid (ProGenGrid) project: a Grid framework for Bionformatics • Data management services • Conclusions and future work

Bioinformatics Issues • Large amounts of data & many applications; • High heterogeneity: ƒ Different types, algorithms, communities, service providers

forms,

implementations,

• High complexity and inter-relations; • Exploitation of large computing power for supporting “in silico” experiments;

Why Bioinformatics Grid? • Deployment, distribution, management needed software components;

system

of

• Harmonized standard integration of various software layers and services; • Powerful, flexible policy definition, control negotiation mechanism for a collaborative environment;

and grid

• The Life Science Grid Research Group established under the Global Grid Forum, underlines as a Grid framework (offering services and standards) satisfies bioinformatics requirements.

ProGenGrid Project The aim of the ProGenGrid project is the creation of a distributed and ubiquitous grid environment for supporting “in silico” experiments in bioinformatics. Using such an environment, that can be considered as a virtual laboratory, the e-scientists will access • • •

analysis tools (e.g. EMBOSS, Blast), biological databases (e.g. GenBank, Protein Data Bank), visualization tools (e.g. Rasmol)

These tools will be available as Web/Grid Service according to a Service Oriented architecture and accessible through a Web Portal.

Service Oriented Architecture Web Service

Web/Grid Service − −

XML,SOAP,WSDL, UDDI

Service description

Grid WSDL • OGSA & WSRF (Open Grid Service Architecture & Web Service Resource Framework)

Service Consumer

SOAP XML based Messaging Re dir ect s to

Redirect to description

ser

Search Service vic

e

UDDI Registry

Allows building enhanced services independently of platform, programming language, tools, and network infrastructure.

Services-layered Architecture

WorkFlow Main Focus

Application Semantic Grid Services

Data Grid Services GridFTP

SRM



Level 4

DAI

XML RDF RDF Schema

Ontology

Level 3

Generic Grid Services GRAM

Genome database

GSI

Protein database

… Disease database

MDS

Clinical database

Level 2

Level 1

Level 1: legacy data sources • Several data sources • Heterogeneity of data sources • Poor level of integration • Legacy catalogue

Framework for supporting bioinformatics research.

Level 2: Generic Grid Services Job submission ƒ GRAM Security ƒ GSI Information Service: ƒ MDS ƒ iGrid Efficient Data Transport ƒ GridFTP

Level 3: Semantic Grid Services Additional information bridging the syntactic and semantic gaps among the individual data sources and the user are provided within the ontologies. ƒ Several format connected with the ontologies are: ƒ XML ƒ RDF ƒ ….. ƒ This level provides services supporting data integration

Ontology An ontology defines a common vocabulary for the information in a specific domain. It includes definitions of basic concepts in the domain and relations among them, which should be interpretable both by machines and humans. • Use of ontologies at two levels: ƒ Workflow Validation during the composition of tasks without know applications details and conversion of input data, if needed. ƒ Data Accessing: 9 Semantic integration of different data sources; 9 Analysis of stored data coming from different experiments.

Ontology of software for ProGenGrid WF Classification of ProGenGrid components software into data banks, bioinformatics algorithms, graphics tools, drug design tools, and input data types. This first ontology, written in DAML+OIL, has been stored in a relational database. role

M

father

type display

INPUT TYPE

N

N

accept

CLASS

M

id_class name description type filename

1

belongs conditiontype conditionvalue

N

WORKFLOW

1

composed by

N

ACTIVITY INSTANCE

M

child

N id_workflow name description filename

id_activity description

Advantages for using ontology ƒTo keep track of 9 input data that a given component could receive; 9 relations between input and output data of the components for determining of rules and establishing the correct flow of data. ƒTo associate a description at the logic name of the activities.

Level 3: Data Grid Services One of the main goal of grid technology is to provide efficient access to data Our scenario is connected with: ƒ A lot of distributed and heterogeneous data sources ƒ Huge amount of data ƒ Intensive computations Bioinformatics need efficient data grid services for data integration

Data Grid Services

Main Focus

Data Integration ƒ Access data from a legacy system may be difficult for several reasons: 9 Developed for a different hardware or software platform 9 Use a different data model 9 Use a different DBMS 9 Use a different data definitions 9 Use a different data format ƒ All these make difficulty in integration and sharing data

Data Integration Client/User RDF

Mediator Mediator Engine

Data Source Ontology

Information Integrator Web Services

Mapper WRAPPER

WRAPPER

WRAPPER

RDF

Scheme

Ontology

WRAPPER WRAPPER

WRAPPER

WRAPPER

Flat File

Relational DB

XML

Standard Database Access Interface 2.0

Std Data source Access Interface

Features: ¾

Standard Access to Data source

¾

Plug-in architecture based on dynamic libraries

¾

Wrapper Extensions for bioinformatic data sources

Level 4: WorkFlow Mng System (WFMS) The WorkFlow Management Coalition (WFMC) defines workflow as: “The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.”

WFMC • founded in 1993, 24 countries, 170 members • terminology, standard interfaces, promotion

Workflow phases Stage 1 - Component discovery: It discovers available bioinformatics tools, data banks, graphics tools, modeled through the ontology; Stage 2 - Workflow editing: Discovered components are made available to a semantic editor that allows the design (i.e. the activities are modeled using UML) of an experiment (Abstract Workflow); Stage 3 - Execution Plan: The abstract workflow is translated into an “execution plan” (Concrete Workflow) containing the activities order and the logical name of the resources (needed for their discovery in a Grid environment); Stage 4 - Application execution: The ProGenGrid scheduler schedules the concrete workflow in a computational grid;

Stage 5 - Application monitoring: Whenever workflow activities are started/finished, the system visualizes the advancement of the workflow execution using a graphical utility.

ProGenGrid Editor Discovery components Available components MOR

Validation MOR Result Traduces Abstract Workflow

Workflow architecture

Execution Plan (Concrete Workflow) Is sent Enactment Service

Query Select

Generates WorkList Activities

Transforms Web Service Invocation Executes Grid Resources

Resource Discovery & Selector

Select

Resource Information Service

Workflow GUI Toolbar for inserting special status, fork, join and condition task.

UML graph related to current workflow

Activities classification avalaible on a computational grid

Graphical WorkFlow Monitoring

Activity and Workflow status with relative applications error messages represented as activities

UML workflowUML classe astratta che rappresenta sia le foglie che l'elemento composto. Fornisce l'interfaccia e il comportamento di default di t utte le classi.

attivit aUML larghezza altez za larghezzaArc o altez zaArco rettangolo : RoundRectangle2D baric entro : Point2D IDUML IDModelloAttivita desc rizione IDOggettoUMLPrec IDOggettoUMLSucc t ipoInputUtente valoreInputUtent e disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getIDModello() getValoreInput() setValoreInput() getTipoInputUtente() setTipoInputUtente() getIDOggettoUMLSucc() getIDOggettoUMLPrec() getWidth() getHeigth() accett aCollegamenti() riceveCollegamenti() setText(St ring) getText() setPrecedente(ID : String) setSuccessivo(ID : String) canc ellaSucces sivo(ID : St ring) canc ellaPrecedente(ID : St ring) getPuntoCollegamento(Line2D)

disegna(Graphics) c ontiene(Point2D) aggiungi(work flowView) elimina(workflowView) getBaricentro() s etBaricentro(Point 2D) getIDUML() getW idt h() getHeigth() accett aCollegamenti() 0..n riceveCollegamenti() s etTex t(St ring) getTex t() s etPrecedent e(ID : St ring) s etSuccessivo(ID : St ring) c ancellaSuc cessivo(ID : St ring) c ancellaPrecedente(ID : St ring) getPunt oCollegament o(Line2D)

startUML diametro baricentro : Point2D cerchio : Ellipse2D IDUML IDOggettoUMLSucc commento disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setSuccessivo(ID : String) cancellaSuccessivo(ID : String) getPuntoCollegamento(Line2D)

endUML

Classe concreta usata per c ontenere e gestire gli element i grafici del workflow.

element iGrafici[0. .*] : workflowView prossimoID

workflowView

1

disegna(Graphic s) contiene(Point2D) aggiungi(workflowView) elimina(workflowView) generaID() getElementiGrafici() collega(origine : workflowView, dest inaz ione : work flowView) collegaCond(origine : conditionUML, dest inazione : work flowView, es pressione : S tring) getAttivitaUML(IDModello : String) getOgget toUML(IDUML : S tring) addStartUML(baricentro : Point 2D) addEndUML(baricentro : Point2D) addForkUML(baricent ro : Point2D) addJoinUML(baricentro : Point2D) addConditionUML(baricentro : Point 2D) disconnett i(origine : workflowView, dest inazione : workflowView) getFrecciaUML(IDUMLOrigine : St ring, IDUMLArrivo : St ring) update(Observable o, Object arg)

forkUML

joinUML

java.util.Obser ver

diametroInt diametroEst baricentro : Point2D cerchioInt : Ellipse2D cerchioEst : Ellipse2D IDUML IDOggettoUMLPrec commento

larghezza altezza baricentro : Point2D rettangolo : Rectangle2D IDUML IDOggettoUMLPrec : String IDOggettiUMLSucc[0..*] : String commento

larghezza altezza baricentro : Point2D rettangolo : Rectangle2D IDUML IDOggettiUMLPrec[0..*] : String IDOggettoUMLSucc : String commento

conditionUML larghezza altezza baricentro : Point2D rombo : Polygon IDUML IDOggettoUMLPrec : String IDOggettiUMLSucc[0..*] : String commento

disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) cancellaPrecedente(ID : String) getIDOggettoUMLPrec() getPuntoCollegamento(Line2D)

disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getOggettiUMLSucc() getIDOggettoUMLPrec() getPuntoCollegamento(Line2D)

disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getOggettoUMLSucc() getIDOggettiUMLPrec() getPuntoCollegamento(Line2D)

disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getIDOggettoUMLPrec() getIDOggettiUMLSucc() getPuntoCollegamento(Line2D)

frecciaUML traiettoria : Line2D effettiva : Line2D WF : workflowUML IDUML IDOggettoUMLOrigine IDOggettoUMLArrivo commento disegna(Graphics) contiene(Point2D) getIDUML() setText(String) getText() getIDOggettoUMLOrigine() getIDOggettoUMLArrivo()

Drug Discovery: Development Life Cycle Discovery (2 to 10 Years)

Preclinical Testing (Lab and Animal Testing)

Phase I (20-30 Healthy Volunteers used to check for safety and dosage)

Phase II (100-300 Patient Volunteers used to check for efficacy and side effects)

Phase III (1000-5000 Patient Volunteers used to monitor reactions to long-term drug use)

$600-700 Million! FDA Review & Approval

Post-Marketing Testing

Years

0

2

4

6

8

10

7 – 15 Years!

12

14

16

Phases of DD • Target Identification − What protein can we attack to stop the disease from progressing?

• Lead discovery & optimization − What sort of molecule will bind to this protein? (Molecular Docking)

• Toxicology − Side effects

Issues and Grid solutions for DD • Screening of a large set of compound 9 The old way: exhaustive screening 9 The new way: parallel screening on Grid!

• Docking 9 The old way: execution of legacy software 9 The new way: docking on large-scale transforming existing sw into a parameter sweep applications for execution on distributed system

Split Service: General Purpose Schema Splitter request XML Format

Split Service

Splitter Component

ACL ID Data & Query

Up/Down load Result

ID

ClientAB

BE HIN

Available IDs Request

HE

3

DT

ClientA

WA

LL

1

1 ClientB Fragments ID

2

Computational Engine

Enhanced Split Service

Within

the

ProGenGrid

project

we

have

been

developing an enhanced Split Service customized for bioinformatics applications. Customizations are related to: 9 Computational Engines ƒ

Autodock,

ƒ

Dock (Sphgen, grid)

ƒ

GAMESS

9 Broker functionalities 9 Workflow support 9 High level functionalities for end users

Conclusions and Future Work ProGenGrid is a software platform allowing the composition of existing bioinformatics resources, wrapped as Web Services, to create complex workflows. In particular, it offers: • tools for services composition, workflow execution and monitoring. • data integration approach to simplify heterogeneous biological databases.

access

to

In the future… Full implementation of the architecture evaluating it with other approaches.

SPACI Project A grid infrastructure based on three geographically spread High Performance Computing Centers located in Southern Italy

Southern Partnership for Advanced Computational Infrastructures

ISUFI/CACT Center for Advanced Computing Technologies University of Lecce Director: Prof. Giovanni Aloisio

CPS/CNR Center for Research on Parallel Computing and Supercomputing (now Section of Naples of ICAR/CNR) Director: Prof. Almerico Murli

MIUR/HPCC Center of Excellence for High Perfomance Computing University of Calabria Director: Prof. Lucio Grandinetti

For any information…

About ProGenGrid Project Project P. I.: Maria Mirto ([email protected]) Giovanni Aloisio ([email protected]) Massimo Cafaro ([email protected]) Sandro Fiore ([email protected]) WebSite: http://datadog.unile.it/progen