Feb 9, 2010 - MAGE (MicroArray and Gene Expression) representations are pri- ... in a complete, computationally useful derivational picture of biomedical.
Representing Microarray Experiment Metadata Using Provenance Models James McCusker February 9, 2010
Abstract MAGE (MicroArray and Gene Expression) representations are primarily representations of workow: a process was used to derive biomaterial A from biomaterial B. This representation is ideally suited for representation using provenance models such as OPM (Open Provenance Model) and PML (Proof Markup Language). We demonstrate methods and tools, MAGE2OPM and MAGE2PML, to convert RDF representations of RDF MAGE graphs to OPM and PML respectively. We evaluate the representations' expressiveness, visualization, and integration with existing data. We argue that provenance models are sucient to represent biomedical experimental metadata in general, and may provide a useful point of reference for unifying information from multiple systems, including biospecimen management, Laboratory Information Systems (LIMS), and computational workow automation tools. This unication results in a complete, computationally useful derivational picture of biomedical experimental data.
1
Introduction
Provenance, as a concept, is critical for successful scientic research. As such, it is vital that systems that support the sciences provide a framework for incorporating provenance information at every step of the research chain. Further, the context that is provided by richly encoded provenance can be used to automate certain aspects of scientic research. For instance, if the system were able to tell that the samples used in two dierent experiments were derived from the same source (such as a tumor, blood sample, or patient) it could possibly automatically integrate those datasets at the level of specicity that the researcher needs. By converting current representations of experimental metadata, such as use of protocols, experimental conditions, which samples were derived in what ways, etc. into a unied model of provenance, it becomes possible to support automated data integration and experimental workow analysis. A rst step in this process is to convert experimental data into a common provenance representation.
1
MAGE (MicroArray and Gene Expression) representations are primarily representations of experimental workow: a process was used to derive biomaterial A from biomaterial B. This representation is ideally suited for representation using provenance models such as OPM (Open Provenance Model) [3] and PML (Proof Markup Language) [1]. I argue that provenance models are sucient to represent biomedical experimental metadata in general, and may provide a useful point of reference for unifying information from multiple systems, including biospecimen management, Laboratory Information Systems (LIMS), and computational workow automation tools. This unication results in a complete, computationally useful derivational picture of biomedical experimental data.
2
Related Work •
Taverna and WINGS provenance recording
•
OPM
•
PML
•
MGED and MAGE
3
Methods
MAGE is most commonly represented using the MAGE-TAB format [4], which is a spreadsheet format that describes microarray and gene expression experiments. Using LIMPOPO [5], an API for parsing MAGE-TAB les, we were able to create MAGETAB2MAGERDF, a tool that can create RDF representations of the MAGE object model [6]. I am able to convert from MAGE-RDF to OPM using declarative rules encoded in SWRL. I have also mapped MAGE-RDF to PML using the same rule-based strategy.
3.1
•
MAGE-RDF to Open Provenance Model Subclasses
opm:Account
∗
opm:Agent& subproperties
∗
mged:Experiment
mged:Contact
opm:Artifact
∗ ∗ ∗
mage:ArrayHybridization mage:ParameterValue mged:ArrayGroup
2
∗ ∗ ∗ ∗ ∗
mged:BioMaterial mged:ExperimentalDesign Observation
mged:Protocol
opm:Role
∗ ∗ ∗ •
mged:BioAssayData
opm:Process
∗
mged:BioAssay
MGEDArtifactRole mged:Parameter mged:Roles
15 SWRL Rules (uses swrlx:makeOWLThing() and currently requires Jess + Pellet)
• 3.2
•
Roles mapped through restrictions on original classes.
MAGE-RDF to Proof Markup Language Subclasses
pmlp:Document
∗
pmlp:Information
∗ ∗ ∗ ∗
mged:BioAssayDataPackage mged:BioMaterial mged:FactorValue
mged:Protocol
pmlj:NodeSet
∗ •
mage:ArrayHybridization
pmlp:DeclarativeRule
∗
mged:Experiment
mage:ProtocolApplication
5 SWRL Rules (using swrlx:makeOWLThing() and currently requires Jess + Pellet)
3
3.3
Evaluation
Expressiveness & Coverage:
evaluate the number of Individuals that are
connected to the provenance graph using the properties that either are part of the provenance model or subproperties of the provenance model.
Visualization:
Display an example provenance graph using tools specically
built for each provenance model. Measure the information density of the visualization using IWBrowser and the OPM dot visualization tool.
Data Integration:
List and demonstrate additional graphs of provenance that
can be added to the graph to extend the existing provenance graph longitudinally.
4
Results and Discussion
4.1
Expressivenes and Coverage
4.2
Visualization Tools
4.3
Data Integration
5
Conclusion
The long term goal is to be able to create complete end-to-end representations of experiments sucient to explore and query similarities between experiments, such as to be able to search for experiments with similar designs and to integrate experimental data based on their use of related artifacts or processes.
These
use cases have been laid out in Moreau 2009 [7].
References [1]
D.L. McGuinness, L. Ding, P.P. da Silva, and C. Chang, Pml 2: A modular explanation interlingua, Proceedings of AAAI, 2007.
[2]
L. Moreau, The Foundations for Provenance on the Web, Foundations and Trends in Web Science, 2009.
[3]
L. Moreau, B. Cliord, J. Freire, Y. Gil, P. Groth, J. Futrelle, N. Kwasnikowska, S. Miles, P. Missier, and J. Myers, The Open Provenance ModelCore Specication (v1.
1), Future Generation Computer Sys-
tems, 2009.
[4]
T. Rayner, P. Rocca-Serra, P. Spellman, H. Causton, A. Farne, E. Holloway, R. Irizarry, J. Liu, D. Maier, M. Miller, K. Petersen, J. Quackenbush, G. Sherlock, C. Stoeckert, J. White, P. Whetzel, F. Wymore, H. Parkinson, U.
4
Sarkans, C. Ball, and A. Brazma, A simple spreadsheet-based, MIAMEsupportive format for microarray data: MAGE-TAB, BMC Bioinformatics, vol. 7, 2006, p. 489.
[5]
LIMPOPO. http://sourceforge.net/projects/limpopo/
[6]
"MAGETAB2MAGERDF." http://code.google.com/p/magetab2rdf/
[7]
S. Miles, P. Groth, M. Branco, and L. Moreau, The requirements of recording and using provenance in e-Science experiments, 2005.
5