A grid environment for high-throughput proteomics - Semantic Scholar

3 downloads 639 Views 583KB Size Report
Dec 19, 2006 - identi cation for the different mass spectrometry devices. This upstream analysis ..... Email -. DMZ Server: - Apache. - mod_Python. - ZSI module.
IEEE TRANSACTION ON NANOBIOSCIENCES

1

A grid environment for high-throughput proteomics Mario Cannataro, Annalisa Barla, Roberto Flor, Giuseppe Jurman, Stefano Merler, Silvano Paoli, Giuseppe Tradigo, Pierangelo Veltri, Cesare Furlanello

Abstract— We connect in a grid-enabled pipeline an ontology-

We propose to respond to these issues by integrating the two

based environment for proteomics spectra management with

main systems for protemics data proling [7], [8], into a single

a machine learning platform for unbiased predictive analysis. We exploit two existing software platforms (MS-Analyzer and BioDCV), the emerging proteomics standards, and the middle-

grid-enabled system. We exploit two existing software platforms with grid capabilities (MS-Analyzer and BioDCV) and

ware and computing resources of the EGEE Biomed VO grid

some emerging proteomics standards (the mzData data struc-

infrastructure. In the setup, BioDCV is accessed by the MS-

tures proposed by the HUPO-PSI initiative [9]). A service-

Analyzer workow as a web service, thus providing a complete

oriented architecture is used to connect the spectra manager

grid environment for proteomics data analysis. Predictive classication studies on MALDI-TOF data based on this environment are presented.

together with preprocessing and predictive modeling resources. In connecting the two platforms, we follow recently introduced paradigms for efcient service composition on scientic grid infrastructures [2]. MS-Analyzer applications are structured as workows and communicate with BioDCV through web

I. I NTRODUCTION Data intensive biomedical applications constitute one of the most challenging application domains for grid technologies. Setting up a grid application in these domains involves dealing with sequences of processing steps that are all crucial to the resulting analyses. From the management of high-throughput raw data to the development of models, there is frequent need to concatenate heterogeneous computing tasks and to integrate information from other sources such as clinical data or lab measures. Medical imaging workows have already been proposed and demonstrated on the Enabling Grid for E-SciencE (EGEE)

service invocation. While several tools are available for handling each step of proteomics analysis, to our knowledge no comprehensive platform has been developed so far to complete the pipeline inside an automated workow for proteomics. The paper is organized as follows: Section II Methods, which represents the core of the paper, includes a detailed description of all the components of the grid environment; Section III describes the two datasets used to validate the approach, while results of experiments on the EGEE Biomed Virtual Organization (VO) are summarized in Section IV. Final comments are reported in Section V.

production grid [1], [2]. In this paper, we present a grid II. M ETHODS

environment that implements this idea for high-throughput proteomics analyses. In particular, we want to grid-enable the various phases related to the modeling and processing of mass spectrometry data. In fact, grid enabling has become one of the major bioinformatic tasks for academy and industry research. The complexity of this task is not just the result of the very large size of spectra databases, but also of two crucial methodological issues both regarding reproducibility of results [3]. First, many alternative procedures are available for noise subtraction, data normalization and candidate peak identication for the different mass spectrometry devices. This upstream analysis affects the detection of informative patterns in the modeling phase, but the rational use and tuning of the preprocessing methods is still awkward and in need of standards [4]. Second, as happened with microarray studies, methodological aws have been shown to affect proteomics studies [5]. One recommended approach is to use replicated experiments [6], its main drawback is the unavoidable high computational burden. To face this, `gridication' has shown itself to be an effective solution [7]. M. Cannataro, G. Tradigo and P. Veltri are with University Magna Græcia, Catanzaro, Italy. E-mail:

{cannataro,

veltri}@unicz.it, [email protected]

S. Paoli, G. Jurman, S. Merler, R. Flor and C. Furlanello are with FBKirst, Trento, Italy. A. Barla is with DISI, University of Genoa. E-mail: jurman, merler, or, furlanello}@itc.it, [email protected] Manuscript received 20/11/2006; revised .

{paoli,

The proposed environment is structured in two systems connected by a web service: an upstream one (MS-Analyzer), responsible for managing and preprocessing the raw data produced by the spectrometer, and a downstream one (BioDCV), responsible for performing classication and feature ranking inside a complete validation methodology. Web services, workows, and grid middleware are used to build the infrastructure. Web services are used to implement spectra preprocessing services (MS-Analyzer) and the unbiased classication service (BioDCV). Workows are used to specify the proteomics pipeline, including loading and preprocessing spectra as well as their preparation for BioDCV classication. Ontologies are used in MS-Analyzer to guide the workow composition as well as to en-light constraints on using the available tools. Grid middleware is used for executing BioDCV classication processes on remote facilities and for secure and efcient data transfer between MS-Analyzer and BioDCV.

A. MS-Analyzer MS-Analyzer [8] is a software environment for the integrated management and processing of mass spectrometry proteomics data developed at University Magna Græcia of

IEEE TRANSACTION ON NANOBIOSCIENCES

2

Catanzaro. Please note the name similarity between our MSAnalyzer system and the MSAnalyzer software developed at Institute for Systems Biology [10]. MS-Analyzer is a Grid-based Problem Solving Environment that uses domain ontologies for modeling software tools and spectra data, and workow techniques for designing data analysis applications (in silico experiments). MS-Analyzer ontologies [11] model bioinformatics knowledge about: (i) bioinformatics software tools (e.g. preprocessing tools); (ii) bioinformatics processes (e.g. a workow of a classication experiment); and (iii) experimental data sets (e.g. a set of spectra related to healthy and diseased subjects). Workows are used to compose the different tasks needed when analyzing spectra data. A peculiar characteristic of MS-Analyzer is the use of ontologies to guide and validate in an interactive way the workow building. Such function is offered by the Ontology-based Workow Designer and Scheduler described in [12] and sketched in the following. An important component of MS-Analyzer, also developed at University Magna Græcia of Catanzaro, is SpecDB, a specialized spectra database used to manage and share experimental spectra data [13]. Finally, MS-Analyzer offers a large set

Fig. 1.

Service Oriented Architecture of MS-Analyzer

of spectra-related services, such as spectra acquisition and conversion, spectra preprocessing for noise cleaning and data reduction, spectra preparation for data mining or statistical analysis, spectra mining (e.g. classication or clustering), and spectra and knowledge models visualization.

data. Such functions are made available as services (see [4] for details on the algorithms).

• Sharing of experimental data, workows and knowledge models. Spectra dataset can be retrieved by querying SpecDB, while executed workows and discovered mod-

MS-Analyzer Architecture MS-Analyzer (see Fig. 1) uses a Service Oriented Architecture (SOA) and provides a collection of specialized spectra management services, including spectra preprocessing,

els are stored in XML les [13].

The SpecDB Spectra Database

spectra analysis (e.g. data mining and visualization), and

The SpecDB spectra database [13] is the data layer of

data movement services. The adoption of the SOA approach

MS-Analyzer (see bottom part of Fig. 1). To face the huge

permits integration into MS-Analyzer of additional spectra

volumes of mass spectra data that cannot be analyzed using

management services (e.g. novel preprocessing tools) and

the main memory alone, a hybrid XML-relational database for

sophisticated, third party analysis tools such as the BioDCV

spectra data was developed. It provides easy, efcient access

service. A central component of the system is the Ontology-

to single spectrum, to multiple spectra, and to relevant portions

based Workow Designer and Scheduler that allows for easy

of spectra. Spectra may be stored using three different formats:

retrieval, composition, and execution of such services. Spec-

• raw (original) les. Original data produced by the MS

tra data, that undergo various transformations depending on

instrument is saved unchanged on the le system. This

the applied preprocessing services, are stored and managed

data is indexed and referenced to the database allowing

into SpecDB. During workow composition, ontologies are searched and browsed to guide the choice of the services and tasks most suitable for solving a problem.

users to retrieve the original dataset;

• tuples in a relational database. An open-source database instance is used to store spectra as couples (intensity,

The main functions offered by MS-Analyzer are the following:

m/z) adorned with meta-information about instrument, experiment, results and clinical/biological annotations;

such

• mzData XML-based instances [9]. Meta-information is

as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS.

stored as XML element instances, where spectra mea-

Formats are unied into mzData, in compliance with the

surements are compressed into an XML element, using

HUPO-PSI proteomics standardization initiative [9].

a simple coding compression formalism. The mzData

• Interface

to

heterogeneous

mass

spectrometers

• Acquisition, storage, and management of MS data using

representation is very useful for sharing whole spectra

the SpecDB database. Spectra are stored in their differ-

among nodes. Moreover, using XML, it is possible to

ent stages (raw, pre-processed, prepared), using different

dene simple views on XML data, permitting ne-tuned,

representations (original, relational, or mzData).

personalized access to such data. mzData instances are

• Preprocessing (smoothing, baseline subtraction, normal-

stored in a native XML database.

ization, binning, peak alignment), preparation (for data

SpecDB is used to store spectra datasets in their different

mining), and analysis (mining and visualization) of MS

stages (raw, preprocessed, prepared), keeping track of the

IEEE TRANSACTION ON NANOBIOSCIENCES

3

different phases of proteomics experiments. Usually, a raw

The BPEL workow language was initially chosen since

spectra dataset is stored in the original format on the le

it is an emerging standard for web services workows and

system for the rst archiving, then the preprocessed dataset

since the availability of some BPEL scheduler (e.g. IBM,

can be stored in both the relational and mzData formats.

Oracle). The rst experience using MS-Analyzer showed us

The former is used to prepare data for further data mining

that not all BPEL functions are important for the typical

while the latter is used for sharing spectra datasets with other

spectra analysis workow; a different workow language more

laboratories [13]. Access to spectra data is provided through

suitable for bioinformatics workows, such as Scu [14], is

a Query Service that supports both SQL like queries on

currently considered.

relational data and XPath queries on mzData schema. The

The Ontology-based Workow Designer and Scheduler

former are used in retrieving portions of spectra, the latter

comprises the following components (see top part of Fig. 1):

are useful for retrieving entire spectra datasets on the basis of

• The Ontology-Based Assistant is a wizard that suggests

mzData metadata, but without inspecting peaks values (stored

the tools available for a given bioinformatics problem

in compressed XML elements).

or describes characteristics of experimental spectra. It

In summary, SpecDB offers the following spectra data

also provides the information about the input and output

management functions that are used by MS-Analyzer services:

allowed for each tool. This information is used to check

• efcient storage and retrieval of data (single spectrum, set of spectra and portions of spectra);

• import/export functions (e.g. loading of raw spectra and exporting of spectra in mzData format);

• query/update functions able to enhance performance of data preprocessing and analysis (e.g. avoiding full main

the consistency of the designed workow.

• The

Workow

Editor,

using

the

functions

of

the

Ontology-Based Assistant, allows users to specify and design applications as workows. Currently, they are designed by using a graphic notation based on UML (Unied Modelling Language).

memory processing); for instance stored SQL procedures

• The Workow Scheduler schedules and controls the ex-

could be used to implement some preprocessing tech-

ecution of activities. To optimize application execution

niques (e.g. binning).

or to satisfy constraints, the Workow Scheduler may

• retrieval through XML-based querying of spectra datasets shared on the Grid.

move data by calling Grid functions, such as the Globus GridFTP [15]. While the Ontology-based Workow Editor has been fully implemented, scheduling activities are currently managed by a rough internal scheduler that

Ontology-Based Workow Designer and Scheduler

analyzes the workow schema and activates the tasks of the workow (represented internally as a direct graph)

Fig. 2 shows a snapshot of the Ontology-based Workow

in a sequential way and in a centralized manner. It

Designer and Scheduler: right panel shows a fragment of the

invokes the web services sending them the proper SOAP

MS-Analyzer ontology that models software tools, left panel

messages, then collects the results (that are stored in

shows available spectra datasets, while in the central panel

SpecDB) that in turn are used to compose the next SOAP

the user can design the workow by using drag&drop of data

message for the next service. In the experiments described

sources and tools. The editor produces an abstract graphical

in Section IV, the normalization, binning and preparation

workow schema that is translated to an internal representation

services are invoked sequentially: raw spectra are rst

(a direct graph) and to the BPEL (Business Process Execution

loaded and normalized, then resulting data are binned

Language for Web Services) workow language.

and nally binned data are transferred to BioDCV. Grid middleware functions, such as Globus GridFTP [15] can be used to transfer datasets when non-local services are involved.

• The Workow Metadata Repository contains all the information on workow schema and execution, including data about control ow (WF Schema Metadata) and data used to perform a task (WF Application Metadata).

B. BioDCV The predictive modeling portion of the proposed system is provided by BioDCV, the FBK-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature-ranking procedure [16]. For proteomics, it includes preprocessing methods adapted Fig. 2.

The Ontology-based Workow Designer and Scheduler

from existing R packages and concatenated to the complete

IEEE TRANSACTION ON NANOBIOSCIENCES

4

MS-Analyzer Ontology-based Workflow Designer

BIoDCV Outputs: Visualization of ATE, Sampletracking, HTML publication, Email notification

BioDCV Grid-enabled Molecular Profiling

Local cluster facility

Biomed VO

Egee UI BioDCV

M-WS

WS -UI.py

Local Cluster front-end BioDCV

Repository URL Email -

Data Metadata

GridFTP GridFTP repository repository

WS - local

Proteomics Data Preparation

Biomarkers data REPORT

BioDCV WS front-end Server BIODcv WS DMZ Server: - Apache - mod_Python - ZSI module

Fig. 3.

The architecture of the proposed environment. See text for details.

validation system. BioDCV has been a grid application since

The BioDCV component is invoked by the MS-Analyzer

2005. Initially designed for local clusters, BioDCV was grid-

workow through a web service called BioDCV-WS. This

enabled so that it could run on the EGRID VO within the

service consists of a Python script that loads the Zolera SOAP

Italian INFN infrastructure. It has accounted for 193 CPU-days

Infrastructure (ZSI) module and that runs on an Apache server.

in benchmarks analyses. Investigations found a linear speed-up

The resulting architecture, displayed in Fig. 3, is described

and a footprint dened as the product of the number of samples

below. The BioDCV-WS service can be called after MS-

and the number of features. Since March 2006, BioDCV has

Analyzer has completed the raw spectra preprocessing and

been running as an external application in the Egee Biomed

the results have been stored on a GridFTP server. The service

VO, the virtual organization for the biomedical domain of the

invocation must specify three arguments: (i) the path on the

EGEE project. The EGEE Grid consists of over 20,000 CPU

GridFTP server where MS-Analyzer has stored its results,

available to users 24 hours a day, 7 days a week, in addition to

(ii) an e-mail address for communicating the status of the

about 5 Petabytes of storage. On average, 20,000 concurrent

BioDCV task execution and the publication location of the

jobs are handled per day.

BioDCV outputs, as well as (iii) whether either the local on

cluster facility or the EGEE biomed VO facility is to be used

LCG2/gLite middleware, as well as on MPI grid sites. The

for computing. At this point, BioDCV-WS retrieves the data

input/output are managed by SQLite databases stored as at

and starts the BioDCV experiment. The preprocessing phase

les on grid Storage Elements. A key aspect is that the

is rst run by means of a R script (developed by the authors

solution does not require any external library to be additionally

and described in [7]); preprocessed spectra and experimental

installed on the grid elements. BioDCV was used in production

design are then stored in a SQLite database. After this step,

for proteomics tasks within the EGEE Biomed VO with a

BioDCV-WS uploads the data and starts one or more jobs on

failure rate close to 2% [7] in rst submission, while all

the selected computing resource.

As

a

grid

application,

BioDCV

can

be

executed

resubmitted experiments successfully completed. The failure rate is dened as the percentage of failing subprograms among those the original complete BioDCV experiment is splitted into.

First, here is what happens when a grid facility is involved. The BioDCV-WS script interacts with a Linux daemon service (BioDCV-WS-UI, also a Python program), running on a LCG2/gLite user interface to submit and control jobs on the EGEE grid infrastructure. In particular, this implementation is

C. Combining by web services

designed for the current Biomed VO of EGEE. When BioDCVWS launches a complete validation proling on the grid, the

Internet web services are used to remotely integrate the main

daemon service BioDCV-WS-UI on the EGEE user interface

components of the proposed environment. A web service is a

rst copies the SQLite database to a Storage Element (SE

software system designed to support interoperable machine-

– the data server machine) and then submits the BioDCV

to-machine interaction over a network. This denition encom-

complete validation procedure as parallel jobs (through a JDL

passes many different specications; the standard one is based

- Job Description Language - le for each job) on the Biomed

on SOAP (Simple Object Access Protocol), using messages

VO grid sites. The BioDCV-WS-UI script checks the status

formatted in XML and sent over the HTTP protocol.

of the parallel jobs and, when they are completed, collects

IEEE TRANSACTION ON NANOBIOSCIENCES

5

their outputs and unies them, adding the result to the original SQLite database. File transfers within the grid are performed by LFC (LCG File Catalog) commands, always according to the user interface commands invoked by BioDCV-WS-UI. A different Python script (BioDCV-WS-Local-Cluster) is used by the web service BioDCV-WS to execute the R script

TABLE I P ROTEOMIC

PROFILING ON

OVARIAN C ANCER

DATA .

T HE 20

DESCENDING ORDER OF IMPORTANCE OF MARKER WITHIN THE TOP

20

AND STANDARD DEVIATION

(E XTS :

NUMBER OF EXTRACTIONS

BEST MARKERS IN

(SD)

100

#

Peaks id

Exts

Mean

facility.

1

29

100

5.3

2

2

49

100

6.3

2.6

3

34

99

2.4

2.4

4

46

99

6.9

4.3

is running, and the MS-Analyzer user is notied by e-mail

5

42

98

5

4.5

6

55

97

9.8

4.5

7

32

93

11.9

3.1

8

51

91

6.2

4.9

9

41

89

9.8

5.4

WS runs in the DMZ area, while the EGEE (BioDCV-WS-

10

40

87

8.6

6.5

11

33

87

9.9

3.7

UI) and the cluster interfaces (BioDCV-WS-Local-Cluster) run

12

35

84

13.4

3.7

behind the rewall without a direct connection to the external

13

59

83

14.7

3.1

14

50

79

13.3

3.3

users. The possibility of running BioDCV-WS as a Python

15

43

74

6.6

4.4

script on an Apache server using Mod Python allows our web

16

37

65

12.4

4.6

17

28

64

10.9

5.3

service to use security procedures such as password, SSL,

18

30

59

14.4

3.7

X509 certicate authentication and LDAP authorization (all

19

53

54

14.2

3.3

20

52

53

13.2

4

Regarding network security, we remark that only BioDCV-

M EAN

SD

as HTML pages on the Apache server where the web service when results are complete.

RUNS ).

OF RANK ARE ALSO LISTED .

and to launch the complete validation phase on the local cluster In both cases, the BioDCV results are eventually published

BEST PEAKS

ACCORDING THE COMPLETE VALIDATION PROCEDURE ARE LISTED IN

supported by Apache). In our case, the GridFTP data server checks the X509 user certicate specically in the EGEE user interface in order to guarantee a secure system for data access. Python was chosen as the script language for code deployment because it is a dynamic object-oriented programming language and because it offers strong support for integration with other languages and tools, including the ZSI library on which our solution is based. The BioDCV-WS web service was tested with alternative client programs in Python and in Java without encountering any difculty. In fact, the use of the ZSI module guarantees compliance with the WSDL (Web Services Description Language) and thus the support of Python, Java

A second dataset, UniCZ1, was also used. It consists of proprietary data produced by the MALDI-TOF facility at University Magna Græcia of Catanzaro for lab calibration purposes. The biomaterial consisted of 20 human serum samples, 10 with two additional proteins and 10 used as controls. The spectra were obtained after 20 technical replicates, with 34671 measurements of mass-to-charge ratios. IV. R ESULTS A suite of classication and proling experiments was

and any other WSDL-compliant solutions. In this solution, the reduction of execution time is not a major reason for using the Grid. This architecture allows

performed on the Ovarian Cancer and the UniCZ1 datasets with the use of all the components in the pipeline.

running a complex clinical proteomics experiment without the

After the preprocessing on MS-Analyzer and the transfer

availability of a local cluster. Moreover, the design can sup-

of data, a set of 100 BioDCV runs was submitted to the grid

port multicenter studies with different proteomics laboratories

infrastructure; the procedure was repeated 10 times, for a total

sharing spectra and clinical information. Such huge datasets

of 20 jdl les submitted through the Biomed VO resource

require high-performance data transfer functions of the Grid

broker to 7 grid sites. All jobs came back successfully. The

such as the Globus GridFTP or Replica Manager.

system is, in this case, providing a ranked list of potential biomarkers: selected peak might dene a isotopic family, or

III. DATA DESCRIPTION

just be a target for further investigation. As an example, the 20 best ranked peaks resulting from the rst experiment are

To test the described environment, a MALDI-TOF Ovarian

described in Tab. I, while the estimated predictive error curve

Cancer dataset (available with the Rdist package [17] from

for the experiment is displayed in Fig. 4. The average run

was processed for

time was 677 seconds; each job was run on two worker nodes

binary classication and proling. Although the dataset is

(for each submission): all of the resulting SQLite databases

of relatively low clinical interest, it is considered a valid

produced as output coincided with the 10 experiments. The

benchmark for algorithm testing. The dataset includes 49

same task was also scheduled on the local cluster mimicking

samples (24 cancer and 25 controls), described by vectors of

the grid worker nodes behavior (each job running on only

56384 values of mass-to-charge measurements, for a total le

2 CPUs), resulting in an average run time of 311 seconds.

http://www-stat.stanford.edu)

size of

892

Kb. The raw data were preprocessed according

For comparison purposes, the same task was performed on a

to the procedure described in [7]. After baseline subtraction,

bi-pro workstation with 2 Intel Pentium IV 3.2 GHzx and

AUC-normalization, centroids identication by clustering and

2Gb RAM, with a total run time of 466 seconds. Again,

peak alignment,

67

centroids were identied, obtaining the

proling results were very close to the grid experiments. The

representation used in our experiments on both the grid and

better performance achieved by the local facilities (cluster and

the local facilities.

workstation) can be explained by taking into account two main

IEEE TRANSACTION ON NANOBIOSCIENCES

6

facts: the small dimension of the involved task and the higher

provided as services by third parties. At the moment, a salient

CPU and RAM characteristics of the local resources with

property of the grid application is that the low job-failure rate

regard to the average worker node on which the jobs run.

of BioDCV is also conrmed within the new environment.

The use of the grid is crucial in case of datasets with a larger

ACKNOWLEDGMENTS

footprint. With small footprints the pure machine learning part may be comparable to the costs of grid procedures. On large

We thank D. Albanese for his support in the development

datasets, the effect of distributing the computation are much

of the BioDCV system and B. Irler for help in proteomics

more relevant, while the housekeeping costs are increasing

data preprocessing. We acknowledge the initial support of the

much less.

EGRID project, S. Cozzini, R. Barbera and M. Mazzuccato in the gridication of BioDCV. We also thank A. Gallo for the integration of MS-Analyzer with BioDCV, G. Cuda and M. Gaspari for providing the proteomics facilities, and P.H.

40

Guzzi and T. Mazza for their contribution in developing the rst version of MS-Analyzer.

ATE

30 20

R EFERENCES [1] C. Germain, V. Breton, P. Clarysse, Y. Gaudeau, T. Glatard, E. Jeannot,

10

Y. Legr´ e, C. Loomis, I. Magnin, J. Montagnat, J.-M. Moureaux, A. Os-

1

5

n

10

15 20

30 40 50 67

orio, X. Pennec, and R. Texie, “Grid-enabling medical image analysis,” Journal of Clinical Monitoring and Computing, vol. 19, no. 4-5, pp. 339–349, 2005. [2] T. Glatard, J. Montagnat, and X. Pennec, “Efcient services composition

Fig. 4.

Example of proling results (Ovarian Cancer dataset): Average Test

Error rate (ATE) is computed on the test portion of the dataset for 100 runs for SVM models with increasing numbers of features.

for grid-enabled data-intensive applications,” in IEEE HPDC 2006, Paris, France, 19-23 Jun 2006, 2006. [3] K. Baggerly, J. Morris, and K. Coombes, “Reproducibility of SELDITOF protein patterns in serum: comparing datasets from different experiments,” Bioinformatics, vol. 20, no. 5, pp. 777–785, 2004. [4] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, and P. Veltri, “Preprocessing of Mass Spectrometry Proteomics Data on the Grid,” in IEEE

9133,17 Da

4000

mean A

Intensity 2000 3000

.95 CI A mean B .95 CI B

CBMS 2005, 23-24 June 2005, Dublin, Ireland, 2005, pp. 549–554. [5] D. Ransohoff, “Lessons From Controversy: Ovarian Cancer Screening and Serum Proteomics,” Journal of the National Cancer Institute, vol. 97, no. 4, pp. 315–319, 2005. [6] A. Molinaro, R. Simon, and R. Pfeiffer, “Prediction error estimation: a comparison of resampling methods,” Bioinformatics, vol. 21, no. 15, pp.

1000

3301–3307, 2005. [7] A. Barla, B. Irler, S. Merler, G. Jurman, S. Paoli, and C. Furlanello,

0

“Proteome proling without selection bias,” in IEEE CBMS 2006, 22-

9100

9120

9140

m/z

9160

9180

9200

23 June 2006, Salt Lake City, US, 2006, pp. 941–946. [8] M. Cannataro and P. Veltri, “MS-Analyzer: Composing and Executing Preprocessing and Data Mining Services for Proteomics Applications

Fig. 5.

Class A and B average spectra at the most discriminative peak

(UniCZ1 dataset). Dotted curves indicate 95% student bootstrap condence intervals.

on the Grid,” Concurrency and Computation: Practice and Experience, 2006, Wiley, Published Online: 19 Dec 2006, In press. [9] S. Orchard, H. Hermjakob, P. Binz, C. Hoogland, C. Taylor, W. Zhu, R. Julian Jr., and R. Apweiler, “Further steps towards data standardisation: The Proteomic Standards Initiative HUPO 3rd annual congress,

A complete process from bench to proling has also been

Beijing 25-27th October, 2004,” Proteomics, vol. 5, no. 2, pp. 337–339, 2005.

tested on the proprietary dataset UniCZ1. The number of

[10] MSAnalyzer, “Seattle proteome center (spc) - proteomics tools, nhlbi

potential markers was reduced to 347 peaks by the prepro-

proteomics center at the institute for systems biology,” 2007. [Online].

cessing phase. In this case, the analysis was used to assess the calibration procedure. The rst discriminant peak is displayed in Fig. 5. Note that for quantitative input data (e.g. ICAT-based MS/MS data or expression level of proteins in different samples), the system can directly provide discriminant proteins.

Available: http://tools.proteomecenter.org/MSAnalyzer.php [11] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, and P. Veltri, “Using ontologies for preprocessing and mining spectra data on the Grid,” FGCS, vol. 23, no. 1, pp. 55–60, 2007. [12] ——, “Managing Ontologies for Grid Computing,” Multiagent and Grid Systems, vol. 2, no. 1, pp. 29–44, 2006. [13] M. Cannataro, G. Tradigo, and P. Veltri, “Sharing Mass Spectrometry Data in a Grid-based Distributed Proteomics Laboratory,” Information

V. C ONCLUSIONS In the set-up described in this presentation, BioDCV is

Processing & Management, vol. 43, no. 3, pp. 577–591, 2007. [14] R. Stevens, A. Robinson, and C. Goble, “mygrid: Personalised bioinformatics on the information grid,” Bioinformatics, vol. 19, no. 1, pp. 302–302, 2004.

accessed from the MS-Analyzer workow as a service, thus

[15] I. Foster and C. Kesselman, “Globus Toolkit Version 4: Software

providing a complete pipeline for proteomics data analysis.

for Service-Oriented Systems,” in IFIP International Conference on

This grid environment takes care of avoiding the selection bias effect in proteomics studies. At the same time, the use of workows and ontologies for spectra management provides a user-friendly interface for application modeling and design.

Network and Parallel Computing, S.-V. L. 3779, Ed., 2005, pp. 2–13. [16] C. Furlanello, M. Serani, S. Merler, and G. Jurman, “Semisupervised learning for molecular proling,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 2, pp. 110–118, 2005. [17] Tibshirani, R. and Hastie, T. and Narasimhan, B. and Soltys, S. and Shi, G. and Kong, A. and Le, Q., “Sample classication from protein mass

The use of the service-oriented architecture makes the system

spectometry, by “peak probability contrasts”,” Bioinformatics, vol. 20,

open and easily extensible through new proteomics tools

no. 17, pp. 3034–3044, 2004.

Suggest Documents