IEEE TRANSACTION ON NANOBIOSCIENCES
1
A grid environment for high-throughput proteomics Mario Cannataro, Annalisa Barla, Roberto Flor, Giuseppe Jurman, Stefano Merler, Silvano Paoli, Giuseppe Tradigo, Pierangelo Veltri, Cesare Furlanello
Abstract We connect in a grid-enabled pipeline an ontology-
We propose to respond to these issues by integrating the two
based environment for proteomics spectra management with
main systems for protemics data proling [7], [8], into a single
a machine learning platform for unbiased predictive analysis. We exploit two existing software platforms (MS-Analyzer and BioDCV), the emerging proteomics standards, and the middle-
grid-enabled system. We exploit two existing software platforms with grid capabilities (MS-Analyzer and BioDCV) and
ware and computing resources of the EGEE Biomed VO grid
some emerging proteomics standards (the mzData data struc-
infrastructure. In the setup, BioDCV is accessed by the MS-
tures proposed by the HUPO-PSI initiative [9]). A service-
Analyzer workow as a web service, thus providing a complete
oriented architecture is used to connect the spectra manager
grid environment for proteomics data analysis. Predictive classication studies on MALDI-TOF data based on this environment are presented.
together with preprocessing and predictive modeling resources. In connecting the two platforms, we follow recently introduced paradigms for efcient service composition on scientic grid infrastructures [2]. MS-Analyzer applications are structured as workows and communicate with BioDCV through web
I. I NTRODUCTION Data intensive biomedical applications constitute one of the most challenging application domains for grid technologies. Setting up a grid application in these domains involves dealing with sequences of processing steps that are all crucial to the resulting analyses. From the management of high-throughput raw data to the development of models, there is frequent need to concatenate heterogeneous computing tasks and to integrate information from other sources such as clinical data or lab measures. Medical imaging workows have already been proposed and demonstrated on the Enabling Grid for E-SciencE (EGEE)
service invocation. While several tools are available for handling each step of proteomics analysis, to our knowledge no comprehensive platform has been developed so far to complete the pipeline inside an automated workow for proteomics. The paper is organized as follows: Section II Methods, which represents the core of the paper, includes a detailed description of all the components of the grid environment; Section III describes the two datasets used to validate the approach, while results of experiments on the EGEE Biomed Virtual Organization (VO) are summarized in Section IV. Final comments are reported in Section V.
production grid [1], [2]. In this paper, we present a grid II. M ETHODS
environment that implements this idea for high-throughput proteomics analyses. In particular, we want to grid-enable the various phases related to the modeling and processing of mass spectrometry data. In fact, grid enabling has become one of the major bioinformatic tasks for academy and industry research. The complexity of this task is not just the result of the very large size of spectra databases, but also of two crucial methodological issues both regarding reproducibility of results [3]. First, many alternative procedures are available for noise subtraction, data normalization and candidate peak identication for the different mass spectrometry devices. This upstream analysis affects the detection of informative patterns in the modeling phase, but the rational use and tuning of the preprocessing methods is still awkward and in need of standards [4]. Second, as happened with microarray studies, methodological aws have been shown to affect proteomics studies [5]. One recommended approach is to use replicated experiments [6], its main drawback is the unavoidable high computational burden. To face this, `gridication' has shown itself to be an effective solution [7]. M. Cannataro, G. Tradigo and P. Veltri are with University Magna Græcia, Catanzaro, Italy. E-mail:
{cannataro,
veltri}@unicz.it,
[email protected]
S. Paoli, G. Jurman, S. Merler, R. Flor and C. Furlanello are with FBKirst, Trento, Italy. A. Barla is with DISI, University of Genoa. E-mail: jurman, merler, or, furlanello}@itc.it,
[email protected] Manuscript received 20/11/2006; revised .
{paoli,
The proposed environment is structured in two systems connected by a web service: an upstream one (MS-Analyzer), responsible for managing and preprocessing the raw data produced by the spectrometer, and a downstream one (BioDCV), responsible for performing classication and feature ranking inside a complete validation methodology. Web services, workows, and grid middleware are used to build the infrastructure. Web services are used to implement spectra preprocessing services (MS-Analyzer) and the unbiased classication service (BioDCV). Workows are used to specify the proteomics pipeline, including loading and preprocessing spectra as well as their preparation for BioDCV classication. Ontologies are used in MS-Analyzer to guide the workow composition as well as to en-light constraints on using the available tools. Grid middleware is used for executing BioDCV classication processes on remote facilities and for secure and efcient data transfer between MS-Analyzer and BioDCV.
A. MS-Analyzer MS-Analyzer [8] is a software environment for the integrated management and processing of mass spectrometry proteomics data developed at University Magna Græcia of
IEEE TRANSACTION ON NANOBIOSCIENCES
2
Catanzaro. Please note the name similarity between our MSAnalyzer system and the MSAnalyzer software developed at Institute for Systems Biology [10]. MS-Analyzer is a Grid-based Problem Solving Environment that uses domain ontologies for modeling software tools and spectra data, and workow techniques for designing data analysis applications (in silico experiments). MS-Analyzer ontologies [11] model bioinformatics knowledge about: (i) bioinformatics software tools (e.g. preprocessing tools); (ii) bioinformatics processes (e.g. a workow of a classication experiment); and (iii) experimental data sets (e.g. a set of spectra related to healthy and diseased subjects). Workows are used to compose the different tasks needed when analyzing spectra data. A peculiar characteristic of MS-Analyzer is the use of ontologies to guide and validate in an interactive way the workow building. Such function is offered by the Ontology-based Workow Designer and Scheduler described in [12] and sketched in the following. An important component of MS-Analyzer, also developed at University Magna Græcia of Catanzaro, is SpecDB, a specialized spectra database used to manage and share experimental spectra data [13]. Finally, MS-Analyzer offers a large set
Fig. 1.
Service Oriented Architecture of MS-Analyzer
of spectra-related services, such as spectra acquisition and conversion, spectra preprocessing for noise cleaning and data reduction, spectra preparation for data mining or statistical analysis, spectra mining (e.g. classication or clustering), and spectra and knowledge models visualization.
data. Such functions are made available as services (see [4] for details on the algorithms).
• Sharing of experimental data, workows and knowledge models. Spectra dataset can be retrieved by querying SpecDB, while executed workows and discovered mod-
MS-Analyzer Architecture MS-Analyzer (see Fig. 1) uses a Service Oriented Architecture (SOA) and provides a collection of specialized spectra management services, including spectra preprocessing,
els are stored in XML les [13].
The SpecDB Spectra Database
spectra analysis (e.g. data mining and visualization), and
The SpecDB spectra database [13] is the data layer of
data movement services. The adoption of the SOA approach
MS-Analyzer (see bottom part of Fig. 1). To face the huge
permits integration into MS-Analyzer of additional spectra
volumes of mass spectra data that cannot be analyzed using
management services (e.g. novel preprocessing tools) and
the main memory alone, a hybrid XML-relational database for
sophisticated, third party analysis tools such as the BioDCV
spectra data was developed. It provides easy, efcient access
service. A central component of the system is the Ontology-
to single spectrum, to multiple spectra, and to relevant portions
based Workow Designer and Scheduler that allows for easy
of spectra. Spectra may be stored using three different formats:
retrieval, composition, and execution of such services. Spec-
• raw (original) les. Original data produced by the MS
tra data, that undergo various transformations depending on
instrument is saved unchanged on the le system. This
the applied preprocessing services, are stored and managed
data is indexed and referenced to the database allowing
into SpecDB. During workow composition, ontologies are searched and browsed to guide the choice of the services and tasks most suitable for solving a problem.
users to retrieve the original dataset;
• tuples in a relational database. An open-source database instance is used to store spectra as couples (intensity,
The main functions offered by MS-Analyzer are the following:
m/z) adorned with meta-information about instrument, experiment, results and clinical/biological annotations;
such
• mzData XML-based instances [9]. Meta-information is
as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS.
stored as XML element instances, where spectra mea-
Formats are unied into mzData, in compliance with the
surements are compressed into an XML element, using
HUPO-PSI proteomics standardization initiative [9].
a simple coding compression formalism. The mzData
• Interface
to
heterogeneous
mass
spectrometers
• Acquisition, storage, and management of MS data using
representation is very useful for sharing whole spectra
the SpecDB database. Spectra are stored in their differ-
among nodes. Moreover, using XML, it is possible to
ent stages (raw, pre-processed, prepared), using different
dene simple views on XML data, permitting ne-tuned,
representations (original, relational, or mzData).
personalized access to such data. mzData instances are
• Preprocessing (smoothing, baseline subtraction, normal-
stored in a native XML database.
ization, binning, peak alignment), preparation (for data
SpecDB is used to store spectra datasets in their different
mining), and analysis (mining and visualization) of MS
stages (raw, preprocessed, prepared), keeping track of the
IEEE TRANSACTION ON NANOBIOSCIENCES
3
different phases of proteomics experiments. Usually, a raw
The BPEL workow language was initially chosen since
spectra dataset is stored in the original format on the le
it is an emerging standard for web services workows and
system for the rst archiving, then the preprocessed dataset
since the availability of some BPEL scheduler (e.g. IBM,
can be stored in both the relational and mzData formats.
Oracle). The rst experience using MS-Analyzer showed us
The former is used to prepare data for further data mining
that not all BPEL functions are important for the typical
while the latter is used for sharing spectra datasets with other
spectra analysis workow; a different workow language more
laboratories [13]. Access to spectra data is provided through
suitable for bioinformatics workows, such as Scu [14], is
a Query Service that supports both SQL like queries on
currently considered.
relational data and XPath queries on mzData schema. The
The Ontology-based Workow Designer and Scheduler
former are used in retrieving portions of spectra, the latter
comprises the following components (see top part of Fig. 1):
are useful for retrieving entire spectra datasets on the basis of
• The Ontology-Based Assistant is a wizard that suggests
mzData metadata, but without inspecting peaks values (stored
the tools available for a given bioinformatics problem
in compressed XML elements).
or describes characteristics of experimental spectra. It
In summary, SpecDB offers the following spectra data
also provides the information about the input and output
management functions that are used by MS-Analyzer services:
allowed for each tool. This information is used to check
• efcient storage and retrieval of data (single spectrum, set of spectra and portions of spectra);
• import/export functions (e.g. loading of raw spectra and exporting of spectra in mzData format);
• query/update functions able to enhance performance of data preprocessing and analysis (e.g. avoiding full main
the consistency of the designed workow.
• The
Workow
Editor,
using
the
functions
of
the
Ontology-Based Assistant, allows users to specify and design applications as workows. Currently, they are designed by using a graphic notation based on UML (Unied Modelling Language).
memory processing); for instance stored SQL procedures
• The Workow Scheduler schedules and controls the ex-
could be used to implement some preprocessing tech-
ecution of activities. To optimize application execution
niques (e.g. binning).
or to satisfy constraints, the Workow Scheduler may
• retrieval through XML-based querying of spectra datasets shared on the Grid.
move data by calling Grid functions, such as the Globus GridFTP [15]. While the Ontology-based Workow Editor has been fully implemented, scheduling activities are currently managed by a rough internal scheduler that
Ontology-Based Workow Designer and Scheduler
analyzes the workow schema and activates the tasks of the workow (represented internally as a direct graph)
Fig. 2 shows a snapshot of the Ontology-based Workow
in a sequential way and in a centralized manner. It
Designer and Scheduler: right panel shows a fragment of the
invokes the web services sending them the proper SOAP
MS-Analyzer ontology that models software tools, left panel
messages, then collects the results (that are stored in
shows available spectra datasets, while in the central panel
SpecDB) that in turn are used to compose the next SOAP
the user can design the workow by using drag&drop of data
message for the next service. In the experiments described
sources and tools. The editor produces an abstract graphical
in Section IV, the normalization, binning and preparation
workow schema that is translated to an internal representation
services are invoked sequentially: raw spectra are rst
(a direct graph) and to the BPEL (Business Process Execution
loaded and normalized, then resulting data are binned
Language for Web Services) workow language.
and nally binned data are transferred to BioDCV. Grid middleware functions, such as Globus GridFTP [15] can be used to transfer datasets when non-local services are involved.
• The Workow Metadata Repository contains all the information on workow schema and execution, including data about control ow (WF Schema Metadata) and data used to perform a task (WF Application Metadata).
B. BioDCV The predictive modeling portion of the proposed system is provided by BioDCV, the FBK-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature-ranking procedure [16]. For proteomics, it includes preprocessing methods adapted Fig. 2.
The Ontology-based Workow Designer and Scheduler
from existing R packages and concatenated to the complete
IEEE TRANSACTION ON NANOBIOSCIENCES
4
MS-Analyzer Ontology-based Workflow Designer
BIoDCV Outputs: Visualization of ATE, Sampletracking, HTML publication, Email notification
BioDCV Grid-enabled Molecular Profiling
Local cluster facility
Biomed VO
Egee UI BioDCV
M-WS
WS -UI.py
Local Cluster front-end BioDCV
Repository URL Email -
Data Metadata
GridFTP GridFTP repository repository
WS - local
Proteomics Data Preparation
Biomarkers data REPORT
BioDCV WS front-end Server BIODcv WS DMZ Server: - Apache - mod_Python - ZSI module
Fig. 3.
The architecture of the proposed environment. See text for details.
validation system. BioDCV has been a grid application since
The BioDCV component is invoked by the MS-Analyzer
2005. Initially designed for local clusters, BioDCV was grid-
workow through a web service called BioDCV-WS. This
enabled so that it could run on the EGRID VO within the
service consists of a Python script that loads the Zolera SOAP
Italian INFN infrastructure. It has accounted for 193 CPU-days
Infrastructure (ZSI) module and that runs on an Apache server.
in benchmarks analyses. Investigations found a linear speed-up
The resulting architecture, displayed in Fig. 3, is described
and a footprint dened as the product of the number of samples
below. The BioDCV-WS service can be called after MS-
and the number of features. Since March 2006, BioDCV has
Analyzer has completed the raw spectra preprocessing and
been running as an external application in the Egee Biomed
the results have been stored on a GridFTP server. The service
VO, the virtual organization for the biomedical domain of the
invocation must specify three arguments: (i) the path on the
EGEE project. The EGEE Grid consists of over 20,000 CPU
GridFTP server where MS-Analyzer has stored its results,
available to users 24 hours a day, 7 days a week, in addition to
(ii) an e-mail address for communicating the status of the
about 5 Petabytes of storage. On average, 20,000 concurrent
BioDCV task execution and the publication location of the
jobs are handled per day.
BioDCV outputs, as well as (iii) whether either the local on
cluster facility or the EGEE biomed VO facility is to be used
LCG2/gLite middleware, as well as on MPI grid sites. The
for computing. At this point, BioDCV-WS retrieves the data
input/output are managed by SQLite databases stored as at
and starts the BioDCV experiment. The preprocessing phase
les on grid Storage Elements. A key aspect is that the
is rst run by means of a R script (developed by the authors
solution does not require any external library to be additionally
and described in [7]); preprocessed spectra and experimental
installed on the grid elements. BioDCV was used in production
design are then stored in a SQLite database. After this step,
for proteomics tasks within the EGEE Biomed VO with a
BioDCV-WS uploads the data and starts one or more jobs on
failure rate close to 2% [7] in rst submission, while all
the selected computing resource.
As
a
grid
application,
BioDCV
can
be
executed
resubmitted experiments successfully completed. The failure rate is dened as the percentage of failing subprograms among those the original complete BioDCV experiment is splitted into.
First, here is what happens when a grid facility is involved. The BioDCV-WS script interacts with a Linux daemon service (BioDCV-WS-UI, also a Python program), running on a LCG2/gLite user interface to submit and control jobs on the EGEE grid infrastructure. In particular, this implementation is
C. Combining by web services
designed for the current Biomed VO of EGEE. When BioDCVWS launches a complete validation proling on the grid, the
Internet web services are used to remotely integrate the main
daemon service BioDCV-WS-UI on the EGEE user interface
components of the proposed environment. A web service is a
rst copies the SQLite database to a Storage Element (SE
software system designed to support interoperable machine-
the data server machine) and then submits the BioDCV
to-machine interaction over a network. This denition encom-
complete validation procedure as parallel jobs (through a JDL
passes many different specications; the standard one is based
- Job Description Language - le for each job) on the Biomed
on SOAP (Simple Object Access Protocol), using messages
VO grid sites. The BioDCV-WS-UI script checks the status
formatted in XML and sent over the HTTP protocol.
of the parallel jobs and, when they are completed, collects
IEEE TRANSACTION ON NANOBIOSCIENCES
5
their outputs and unies them, adding the result to the original SQLite database. File transfers within the grid are performed by LFC (LCG File Catalog) commands, always according to the user interface commands invoked by BioDCV-WS-UI. A different Python script (BioDCV-WS-Local-Cluster) is used by the web service BioDCV-WS to execute the R script
TABLE I P ROTEOMIC
PROFILING ON
OVARIAN C ANCER
DATA .
T HE 20
DESCENDING ORDER OF IMPORTANCE OF MARKER WITHIN THE TOP
20
AND STANDARD DEVIATION
(E XTS :
NUMBER OF EXTRACTIONS
BEST MARKERS IN
(SD)
100
#
Peaks id
Exts
Mean
facility.
1
29
100
5.3
2
2
49
100
6.3
2.6
3
34
99
2.4
2.4
4
46
99
6.9
4.3
is running, and the MS-Analyzer user is notied by e-mail
5
42
98
5
4.5
6
55
97
9.8
4.5
7
32
93
11.9
3.1
8
51
91
6.2
4.9
9
41
89
9.8
5.4
WS runs in the DMZ area, while the EGEE (BioDCV-WS-
10
40
87
8.6
6.5
11
33
87
9.9
3.7
UI) and the cluster interfaces (BioDCV-WS-Local-Cluster) run
12
35
84
13.4
3.7
behind the rewall without a direct connection to the external
13
59
83
14.7
3.1
14
50
79
13.3
3.3
users. The possibility of running BioDCV-WS as a Python
15
43
74
6.6
4.4
script on an Apache server using Mod Python allows our web
16
37
65
12.4
4.6
17
28
64
10.9
5.3
service to use security procedures such as password, SSL,
18
30
59
14.4
3.7
X509 certicate authentication and LDAP authorization (all
19
53
54
14.2
3.3
20
52
53
13.2
4
Regarding network security, we remark that only BioDCV-
M EAN
SD
as HTML pages on the Apache server where the web service when results are complete.
RUNS ).
OF RANK ARE ALSO LISTED .
and to launch the complete validation phase on the local cluster In both cases, the BioDCV results are eventually published
BEST PEAKS
ACCORDING THE COMPLETE VALIDATION PROCEDURE ARE LISTED IN
supported by Apache). In our case, the GridFTP data server checks the X509 user certicate specically in the EGEE user interface in order to guarantee a secure system for data access. Python was chosen as the script language for code deployment because it is a dynamic object-oriented programming language and because it offers strong support for integration with other languages and tools, including the ZSI library on which our solution is based. The BioDCV-WS web service was tested with alternative client programs in Python and in Java without encountering any difculty. In fact, the use of the ZSI module guarantees compliance with the WSDL (Web Services Description Language) and thus the support of Python, Java
A second dataset, UniCZ1, was also used. It consists of proprietary data produced by the MALDI-TOF facility at University Magna Græcia of Catanzaro for lab calibration purposes. The biomaterial consisted of 20 human serum samples, 10 with two additional proteins and 10 used as controls. The spectra were obtained after 20 technical replicates, with 34671 measurements of mass-to-charge ratios. IV. R ESULTS A suite of classication and proling experiments was
and any other WSDL-compliant solutions. In this solution, the reduction of execution time is not a major reason for using the Grid. This architecture allows
performed on the Ovarian Cancer and the UniCZ1 datasets with the use of all the components in the pipeline.
running a complex clinical proteomics experiment without the
After the preprocessing on MS-Analyzer and the transfer
availability of a local cluster. Moreover, the design can sup-
of data, a set of 100 BioDCV runs was submitted to the grid
port multicenter studies with different proteomics laboratories
infrastructure; the procedure was repeated 10 times, for a total
sharing spectra and clinical information. Such huge datasets
of 20 jdl les submitted through the Biomed VO resource
require high-performance data transfer functions of the Grid
broker to 7 grid sites. All jobs came back successfully. The
such as the Globus GridFTP or Replica Manager.
system is, in this case, providing a ranked list of potential biomarkers: selected peak might dene a isotopic family, or
III. DATA DESCRIPTION
just be a target for further investigation. As an example, the 20 best ranked peaks resulting from the rst experiment are
To test the described environment, a MALDI-TOF Ovarian
described in Tab. I, while the estimated predictive error curve
Cancer dataset (available with the Rdist package [17] from
for the experiment is displayed in Fig. 4. The average run
was processed for
time was 677 seconds; each job was run on two worker nodes
binary classication and proling. Although the dataset is
(for each submission): all of the resulting SQLite databases
of relatively low clinical interest, it is considered a valid
produced as output coincided with the 10 experiments. The
benchmark for algorithm testing. The dataset includes 49
same task was also scheduled on the local cluster mimicking
samples (24 cancer and 25 controls), described by vectors of
the grid worker nodes behavior (each job running on only
56384 values of mass-to-charge measurements, for a total le
2 CPUs), resulting in an average run time of 311 seconds.
http://www-stat.stanford.edu)
size of
892
Kb. The raw data were preprocessed according
For comparison purposes, the same task was performed on a
to the procedure described in [7]. After baseline subtraction,
bi-pro workstation with 2 Intel Pentium IV 3.2 GHzx and
AUC-normalization, centroids identication by clustering and
2Gb RAM, with a total run time of 466 seconds. Again,
peak alignment,
67
centroids were identied, obtaining the
proling results were very close to the grid experiments. The
representation used in our experiments on both the grid and
better performance achieved by the local facilities (cluster and
the local facilities.
workstation) can be explained by taking into account two main
IEEE TRANSACTION ON NANOBIOSCIENCES
6
facts: the small dimension of the involved task and the higher
provided as services by third parties. At the moment, a salient
CPU and RAM characteristics of the local resources with
property of the grid application is that the low job-failure rate
regard to the average worker node on which the jobs run.
of BioDCV is also conrmed within the new environment.
The use of the grid is crucial in case of datasets with a larger
ACKNOWLEDGMENTS
footprint. With small footprints the pure machine learning part may be comparable to the costs of grid procedures. On large
We thank D. Albanese for his support in the development
datasets, the effect of distributing the computation are much
of the BioDCV system and B. Irler for help in proteomics
more relevant, while the housekeeping costs are increasing
data preprocessing. We acknowledge the initial support of the
much less.
EGRID project, S. Cozzini, R. Barbera and M. Mazzuccato in the gridication of BioDCV. We also thank A. Gallo for the integration of MS-Analyzer with BioDCV, G. Cuda and M. Gaspari for providing the proteomics facilities, and P.H.
40
Guzzi and T. Mazza for their contribution in developing the rst version of MS-Analyzer.
ATE
30 20
R EFERENCES [1] C. Germain, V. Breton, P. Clarysse, Y. Gaudeau, T. Glatard, E. Jeannot,
10
Y. Legr´ e, C. Loomis, I. Magnin, J. Montagnat, J.-M. Moureaux, A. Os-
1
5
n
10
15 20
30 40 50 67
orio, X. Pennec, and R. Texie, Grid-enabling medical image analysis, Journal of Clinical Monitoring and Computing, vol. 19, no. 4-5, pp. 339349, 2005. [2] T. Glatard, J. Montagnat, and X. Pennec, Efcient services composition
Fig. 4.
Example of proling results (Ovarian Cancer dataset): Average Test
Error rate (ATE) is computed on the test portion of the dataset for 100 runs for SVM models with increasing numbers of features.
for grid-enabled data-intensive applications, in IEEE HPDC 2006, Paris, France, 19-23 Jun 2006, 2006. [3] K. Baggerly, J. Morris, and K. Coombes, Reproducibility of SELDITOF protein patterns in serum: comparing datasets from different experiments, Bioinformatics, vol. 20, no. 5, pp. 777785, 2004. [4] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, and P. Veltri, Preprocessing of Mass Spectrometry Proteomics Data on the Grid, in IEEE
9133,17 Da
4000
mean A
Intensity 2000 3000
.95 CI A mean B .95 CI B
CBMS 2005, 23-24 June 2005, Dublin, Ireland, 2005, pp. 549554. [5] D. Ransohoff, Lessons From Controversy: Ovarian Cancer Screening and Serum Proteomics, Journal of the National Cancer Institute, vol. 97, no. 4, pp. 315319, 2005. [6] A. Molinaro, R. Simon, and R. Pfeiffer, Prediction error estimation: a comparison of resampling methods, Bioinformatics, vol. 21, no. 15, pp.
1000
33013307, 2005. [7] A. Barla, B. Irler, S. Merler, G. Jurman, S. Paoli, and C. Furlanello,
0
Proteome proling without selection bias, in IEEE CBMS 2006, 22-
9100
9120
9140
m/z
9160
9180
9200
23 June 2006, Salt Lake City, US, 2006, pp. 941946. [8] M. Cannataro and P. Veltri, MS-Analyzer: Composing and Executing Preprocessing and Data Mining Services for Proteomics Applications
Fig. 5.
Class A and B average spectra at the most discriminative peak
(UniCZ1 dataset). Dotted curves indicate 95% student bootstrap condence intervals.
on the Grid, Concurrency and Computation: Practice and Experience, 2006, Wiley, Published Online: 19 Dec 2006, In press. [9] S. Orchard, H. Hermjakob, P. Binz, C. Hoogland, C. Taylor, W. Zhu, R. Julian Jr., and R. Apweiler, Further steps towards data standardisation: The Proteomic Standards Initiative HUPO 3rd annual congress,
A complete process from bench to proling has also been
Beijing 25-27th October, 2004, Proteomics, vol. 5, no. 2, pp. 337339, 2005.
tested on the proprietary dataset UniCZ1. The number of
[10] MSAnalyzer, Seattle proteome center (spc) - proteomics tools, nhlbi
potential markers was reduced to 347 peaks by the prepro-
proteomics center at the institute for systems biology, 2007. [Online].
cessing phase. In this case, the analysis was used to assess the calibration procedure. The rst discriminant peak is displayed in Fig. 5. Note that for quantitative input data (e.g. ICAT-based MS/MS data or expression level of proteins in different samples), the system can directly provide discriminant proteins.
Available: http://tools.proteomecenter.org/MSAnalyzer.php [11] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, and P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid, FGCS, vol. 23, no. 1, pp. 5560, 2007. [12] , Managing Ontologies for Grid Computing, Multiagent and Grid Systems, vol. 2, no. 1, pp. 2944, 2006. [13] M. Cannataro, G. Tradigo, and P. Veltri, Sharing Mass Spectrometry Data in a Grid-based Distributed Proteomics Laboratory, Information
V. C ONCLUSIONS In the set-up described in this presentation, BioDCV is
Processing & Management, vol. 43, no. 3, pp. 577591, 2007. [14] R. Stevens, A. Robinson, and C. Goble, mygrid: Personalised bioinformatics on the information grid, Bioinformatics, vol. 19, no. 1, pp. 302302, 2004.
accessed from the MS-Analyzer workow as a service, thus
[15] I. Foster and C. Kesselman, Globus Toolkit Version 4: Software
providing a complete pipeline for proteomics data analysis.
for Service-Oriented Systems, in IFIP International Conference on
This grid environment takes care of avoiding the selection bias effect in proteomics studies. At the same time, the use of workows and ontologies for spectra management provides a user-friendly interface for application modeling and design.
Network and Parallel Computing, S.-V. L. 3779, Ed., 2005, pp. 213. [16] C. Furlanello, M. Serani, S. Merler, and G. Jurman, Semisupervised learning for molecular proling, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 2, pp. 110118, 2005. [17] Tibshirani, R. and Hastie, T. and Narasimhan, B. and Soltys, S. and Shi, G. and Kong, A. and Le, Q., Sample classication from protein mass
The use of the service-oriented architecture makes the system
spectometry, by peak probability contrasts, Bioinformatics, vol. 20,
open and easily extensible through new proteomics tools
no. 17, pp. 30343044, 2004.