ProGenGrid: a Grid-enabled platform for Bioinformatics G. Aloisio, M. Cafaro, S. Fiore, M. Mirto
CACT/ISUFI and SPACI Consortium, University of Lecce, Italy HealthGrid 2005 7th-9th April, Oxford
OUTLINE • Bioinformatics: some issues • Why Bioinformatics Grid? • The Proteomics and Genomics Grid (ProGenGrid) project: a Grid framework for Bionformatics • Data management services • Conclusions and future work
Bioinformatics Issues • Large amounts of data & many applications; • High heterogeneity: Different types, algorithms, communities, service providers
forms,
implementations,
• High complexity and inter-relations; • Exploitation of large computing power for supporting “in silico” experiments;
Why Bioinformatics Grid? • Deployment, distribution, management needed software components;
system
of
• Harmonized standard integration of various software layers and services; • Powerful, flexible policy definition, control negotiation mechanism for a collaborative environment;
and grid
• The Life Science Grid Research Group established under the Global Grid Forum, underlines as a Grid framework (offering services and standards) satisfies bioinformatics requirements.
ProGenGrid Project The aim of the ProGenGrid project is the creation of a distributed and ubiquitous grid environment for supporting “in silico” experiments in bioinformatics. Using such an environment, that can be considered as a virtual laboratory, the e-scientists will access • • •
analysis tools (e.g. EMBOSS, Blast), biological databases (e.g. GenBank, Protein Data Bank), visualization tools (e.g. Rasmol)
These tools will be available as Web/Grid Service according to a Service Oriented architecture and accessible through a Web Portal.
Service Oriented Architecture Web Service
Web/Grid Service − −
XML,SOAP,WSDL, UDDI
Service description
Grid WSDL • OGSA & WSRF (Open Grid Service Architecture & Web Service Resource Framework)
Service Consumer
SOAP XML based Messaging Re dir ect s to
Redirect to description
ser
Search Service vic
e
UDDI Registry
Allows building enhanced services independently of platform, programming language, tools, and network infrastructure.
Services-layered Architecture
WorkFlow Main Focus
Application Semantic Grid Services
Data Grid Services GridFTP
SRM
…
Level 4
DAI
XML RDF RDF Schema
Ontology
Level 3
Generic Grid Services GRAM
Genome database
GSI
Protein database
… Disease database
MDS
Clinical database
Level 2
Level 1
Level 1: legacy data sources • Several data sources • Heterogeneity of data sources • Poor level of integration • Legacy catalogue
Framework for supporting bioinformatics research.
Level 2: Generic Grid Services Job submission GRAM Security GSI Information Service: MDS iGrid Efficient Data Transport GridFTP
Level 3: Semantic Grid Services Additional information bridging the syntactic and semantic gaps among the individual data sources and the user are provided within the ontologies. Several format connected with the ontologies are: XML RDF ….. This level provides services supporting data integration
Ontology An ontology defines a common vocabulary for the information in a specific domain. It includes definitions of basic concepts in the domain and relations among them, which should be interpretable both by machines and humans. • Use of ontologies at two levels: Workflow Validation during the composition of tasks without know applications details and conversion of input data, if needed. Data Accessing: 9 Semantic integration of different data sources; 9 Analysis of stored data coming from different experiments.
Ontology of software for ProGenGrid WF Classification of ProGenGrid components software into data banks, bioinformatics algorithms, graphics tools, drug design tools, and input data types. This first ontology, written in DAML+OIL, has been stored in a relational database. role
M
father
type display
INPUT TYPE
N
N
accept
CLASS
M
id_class name description type filename
1
belongs conditiontype conditionvalue
N
WORKFLOW
1
composed by
N
ACTIVITY INSTANCE
M
child
N id_workflow name description filename
id_activity description
Advantages for using ontology To keep track of 9 input data that a given component could receive; 9 relations between input and output data of the components for determining of rules and establishing the correct flow of data. To associate a description at the logic name of the activities.
Level 3: Data Grid Services One of the main goal of grid technology is to provide efficient access to data Our scenario is connected with: A lot of distributed and heterogeneous data sources Huge amount of data Intensive computations Bioinformatics need efficient data grid services for data integration
Data Grid Services
Main Focus
Data Integration Access data from a legacy system may be difficult for several reasons: 9 Developed for a different hardware or software platform 9 Use a different data model 9 Use a different DBMS 9 Use a different data definitions 9 Use a different data format All these make difficulty in integration and sharing data
Data Integration Client/User RDF
Mediator Mediator Engine
Data Source Ontology
Information Integrator Web Services
Mapper WRAPPER
WRAPPER
WRAPPER
RDF
Scheme
Ontology
WRAPPER WRAPPER
WRAPPER
WRAPPER
Flat File
Relational DB
XML
Standard Database Access Interface 2.0
Std Data source Access Interface
Features: ¾
Standard Access to Data source
¾
Plug-in architecture based on dynamic libraries
¾
Wrapper Extensions for bioinformatic data sources
Level 4: WorkFlow Mng System (WFMS) The WorkFlow Management Coalition (WFMC) defines workflow as: “The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.”
WFMC • founded in 1993, 24 countries, 170 members • terminology, standard interfaces, promotion
Workflow phases Stage 1 - Component discovery: It discovers available bioinformatics tools, data banks, graphics tools, modeled through the ontology; Stage 2 - Workflow editing: Discovered components are made available to a semantic editor that allows the design (i.e. the activities are modeled using UML) of an experiment (Abstract Workflow); Stage 3 - Execution Plan: The abstract workflow is translated into an “execution plan” (Concrete Workflow) containing the activities order and the logical name of the resources (needed for their discovery in a Grid environment); Stage 4 - Application execution: The ProGenGrid scheduler schedules the concrete workflow in a computational grid;
Stage 5 - Application monitoring: Whenever workflow activities are started/finished, the system visualizes the advancement of the workflow execution using a graphical utility.
ProGenGrid Editor Discovery components Available components MOR
Validation MOR Result Traduces Abstract Workflow
Workflow architecture
Execution Plan (Concrete Workflow) Is sent Enactment Service
Query Select
Generates WorkList Activities
Transforms Web Service Invocation Executes Grid Resources
Resource Discovery & Selector
Select
Resource Information Service
Workflow GUI Toolbar for inserting special status, fork, join and condition task.
UML graph related to current workflow
Activities classification avalaible on a computational grid
Graphical WorkFlow Monitoring
Activity and Workflow status with relative applications error messages represented as activities
UML workflowUML classe astratta che rappresenta sia le foglie che l'elemento composto. Fornisce l'interfaccia e il comportamento di default di t utte le classi.
attivit aUML larghezza altez za larghezzaArc o altez zaArco rettangolo : RoundRectangle2D baric entro : Point2D IDUML IDModelloAttivita desc rizione IDOggettoUMLPrec IDOggettoUMLSucc t ipoInputUtente valoreInputUtent e disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getIDModello() getValoreInput() setValoreInput() getTipoInputUtente() setTipoInputUtente() getIDOggettoUMLSucc() getIDOggettoUMLPrec() getWidth() getHeigth() accett aCollegamenti() riceveCollegamenti() setText(St ring) getText() setPrecedente(ID : String) setSuccessivo(ID : String) canc ellaSucces sivo(ID : St ring) canc ellaPrecedente(ID : St ring) getPuntoCollegamento(Line2D)
disegna(Graphics) c ontiene(Point2D) aggiungi(work flowView) elimina(workflowView) getBaricentro() s etBaricentro(Point 2D) getIDUML() getW idt h() getHeigth() accett aCollegamenti() 0..n riceveCollegamenti() s etTex t(St ring) getTex t() s etPrecedent e(ID : St ring) s etSuccessivo(ID : St ring) c ancellaSuc cessivo(ID : St ring) c ancellaPrecedente(ID : St ring) getPunt oCollegament o(Line2D)
startUML diametro baricentro : Point2D cerchio : Ellipse2D IDUML IDOggettoUMLSucc commento disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setSuccessivo(ID : String) cancellaSuccessivo(ID : String) getPuntoCollegamento(Line2D)
endUML
Classe concreta usata per c ontenere e gestire gli element i grafici del workflow.
element iGrafici[0. .*] : workflowView prossimoID
workflowView
1
disegna(Graphic s) contiene(Point2D) aggiungi(workflowView) elimina(workflowView) generaID() getElementiGrafici() collega(origine : workflowView, dest inaz ione : work flowView) collegaCond(origine : conditionUML, dest inazione : work flowView, es pressione : S tring) getAttivitaUML(IDModello : String) getOgget toUML(IDUML : S tring) addStartUML(baricentro : Point 2D) addEndUML(baricentro : Point2D) addForkUML(baricent ro : Point2D) addJoinUML(baricentro : Point2D) addConditionUML(baricentro : Point 2D) disconnett i(origine : workflowView, dest inazione : workflowView) getFrecciaUML(IDUMLOrigine : St ring, IDUMLArrivo : St ring) update(Observable o, Object arg)
forkUML
joinUML
java.util.Obser ver
diametroInt diametroEst baricentro : Point2D cerchioInt : Ellipse2D cerchioEst : Ellipse2D IDUML IDOggettoUMLPrec commento
larghezza altezza baricentro : Point2D rettangolo : Rectangle2D IDUML IDOggettoUMLPrec : String IDOggettiUMLSucc[0..*] : String commento
larghezza altezza baricentro : Point2D rettangolo : Rectangle2D IDUML IDOggettiUMLPrec[0..*] : String IDOggettoUMLSucc : String commento
conditionUML larghezza altezza baricentro : Point2D rombo : Polygon IDUML IDOggettoUMLPrec : String IDOggettiUMLSucc[0..*] : String commento
disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) cancellaPrecedente(ID : String) getIDOggettoUMLPrec() getPuntoCollegamento(Line2D)
disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getOggettiUMLSucc() getIDOggettoUMLPrec() getPuntoCollegamento(Line2D)
disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getOggettoUMLSucc() getIDOggettiUMLPrec() getPuntoCollegamento(Line2D)
disegna(Graphics) contiene(Point2D) getBaricentro() setBaricentro(Point2D) getIDUML() getWidth() getHeigth() accettaCollegamenti() riceveCollegamenti() setText(String) getText() setPrecedente(ID : String) setSuccessivo(ID : String) cancellaSuccessivo(ID : String) cancellaPrecedente(ID : String) getIDOggettoUMLPrec() getIDOggettiUMLSucc() getPuntoCollegamento(Line2D)
frecciaUML traiettoria : Line2D effettiva : Line2D WF : workflowUML IDUML IDOggettoUMLOrigine IDOggettoUMLArrivo commento disegna(Graphics) contiene(Point2D) getIDUML() setText(String) getText() getIDOggettoUMLOrigine() getIDOggettoUMLArrivo()
Drug Discovery: Development Life Cycle Discovery (2 to 10 Years)
Preclinical Testing (Lab and Animal Testing)
Phase I (20-30 Healthy Volunteers used to check for safety and dosage)
Phase II (100-300 Patient Volunteers used to check for efficacy and side effects)
Phase III (1000-5000 Patient Volunteers used to monitor reactions to long-term drug use)
$600-700 Million! FDA Review & Approval
Post-Marketing Testing
Years
0
2
4
6
8
10
7 – 15 Years!
12
14
16
Phases of DD • Target Identification − What protein can we attack to stop the disease from progressing?
• Lead discovery & optimization − What sort of molecule will bind to this protein? (Molecular Docking)
• Toxicology − Side effects
Issues and Grid solutions for DD • Screening of a large set of compound 9 The old way: exhaustive screening 9 The new way: parallel screening on Grid!
• Docking 9 The old way: execution of legacy software 9 The new way: docking on large-scale transforming existing sw into a parameter sweep applications for execution on distributed system
Split Service: General Purpose Schema Splitter request XML Format
Split Service
Splitter Component
ACL ID Data & Query
Up/Down load Result
ID
ClientAB
BE HIN
Available IDs Request
HE
3
DT
ClientA
WA
LL
1
1 ClientB Fragments ID
2
Computational Engine
Enhanced Split Service
Within
the
ProGenGrid
project
we
have
been
developing an enhanced Split Service customized for bioinformatics applications. Customizations are related to: 9 Computational Engines
Autodock,
Dock (Sphgen, grid)
GAMESS
9 Broker functionalities 9 Workflow support 9 High level functionalities for end users
Conclusions and Future Work ProGenGrid is a software platform allowing the composition of existing bioinformatics resources, wrapped as Web Services, to create complex workflows. In particular, it offers: • tools for services composition, workflow execution and monitoring. • data integration approach to simplify heterogeneous biological databases.
access
to
In the future… Full implementation of the architecture evaluating it with other approaches.
SPACI Project A grid infrastructure based on three geographically spread High Performance Computing Centers located in Southern Italy
Southern Partnership for Advanced Computational Infrastructures
ISUFI/CACT Center for Advanced Computing Technologies University of Lecce Director: Prof. Giovanni Aloisio
CPS/CNR Center for Research on Parallel Computing and Supercomputing (now Section of Naples of ICAR/CNR) Director: Prof. Almerico Murli
MIUR/HPCC Center of Excellence for High Perfomance Computing University of Calabria Director: Prof. Lucio Grandinetti
For any information…
About ProGenGrid Project Project P. I.: Maria Mirto (
[email protected]) Giovanni Aloisio (
[email protected]) Massimo Cafaro (
[email protected]) Sandro Fiore (
[email protected]) WebSite: http://datadog.unile.it/progen