a web semantic for sbml merge - ProQuest Search

5 downloads 693 Views 13MB Size Report
How to merge biological models deterministically to integrate the merging process in a WMS? ... when you call setUseValuesFromTriggerTime (boolean.
A WEB SEMANTIC FOR SBML MERGE

By

Mathialakan Thavappiragasam, B.S (Hons), University of Jaffna, 2005

A Thesis Submitted in Partial Fulfillment of The Requirement for the Degree of Master of Science

Department of Computer Science The University of South Dakota May, 2014

UMI Number: 1566784

All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

UMI 1566784 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code

ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346

DEDICATION This thesis is dedicated to my beloved late mother, my god Pushpavathy.

ACKNOWLEDGEMENTS I take this opportunity to thank all who have helped in many ways during the course of this project. Most importantly, I would like to thank my advisor Prof. Etienne Gnimpieba for the patient guidance and motivation throughout the project. I am extremely lucky to have such an active mentor who cared with my work. I gratefully thank Prof. Doug Goodman for his words of advice and engorgement to me during the course of my tenure at USD as a graduate student. I wish to express my sincere thanks to Prof. Carol Lushbough who has been supporting me through extremely difficult times to ensure that this project culminate successfully. I also like to thank Prof. Paula Mabee for her valuable guidance regarding biology related studies. My special thanks go to Prof Richard Mc Bride for his advisable guidance and teaching. I was extraordinarily fortunate in having Dr S Mahesan as my teacher in University of Jaffna. I could never have done all of this without his prior teaching in computer science. Thank you. Furthermore, I acknowledged my buddies: Shelton J Pragash and Bill Conn for their valuable support throughout my stint at USD. I would like to thank my father, Thavappiragasam for his inseparable support and prayers. Words fail me to express my appreciation to my wife Kukapalini whose devotion, love and persistent confidence in me, has taken the load off my shoulder. I owe her for being unselfishly let her intelligence, passions, and ambitions collide with mine.

Finally, I would like to thank everybody who was important to the successful of my thesis, as well as expressing my apology that I could not mention personally one by one.

Contents 1.

2.

3.

Introduction ................................................................................................................. 1 1.1.

Context and motivation ........................................................................................ 1

1.2.

Existing work and limits ...................................................................................... 4

1.3.

Research goals and objectives .............................................................................. 5

1.4.

Literature review .................................................................................................. 6

a.

The System Biology Markup Language (SBML) ............................................ 6

b.

Biological model comparison ......................................................................... 11

c.

The semantic in biological modeling .............................................................. 14

1.5.

Methodology and algorithm ............................................................................... 15

a.

SBML merge principles.................................................................................. 18

b.

SBML merge implementation ........................................................................ 18

The SBMLChecker for reliability checking ............................................................. 21 2.1.

Principle ............................................................................................................. 22

2.2.

Design and algorithms for SBMLChecker ......................................................... 22

2.3.

Implementation................................................................................................... 24

2.4.

Validation ........................................................................................................... 25

2.5.

SBMLAnot for model annotation ...................................................................... 25

a.

Principle .......................................................................................................... 25

b.

Design and algorithm for SBMLAnot ............................................................ 25

c.

Implementation ............................................................................................... 27

d.

Validation ....................................................................................................... 27

Algorithm for BioML Similarity (ABioS) ................................................................ 28 i

4.

5.

3.1.

Principle ............................................................................................................. 28

3.2.

ABioS design...................................................................................................... 29

3.3.

Application of ABios on FOCM ........................................................................ 33

3.4.

ABioS algorithm validation on Yeast models .................................................... 36

The SBMLCompare for model comparison ............................................................. 39 4.1.

Principle ............................................................................................................. 40

4.2.

Design and algorithms for SBMLCompare ....................................................... 40

4.3.

Implementation................................................................................................... 41

4.4.

Application on yeast (e-coli) mathematical models comparison ....................... 42

SBMLMerge for the model merging ........................................................................ 46 5.1.

Principles ............................................................................................................ 46

5.2.

Design and algorithms for SBMLMerge............................................................ 46

5.3.

Implementation................................................................................................... 51

5.4.

Validation ........................................................................................................... 51

6.

Result and discussion ................................................................................................ 52

7.

Conclusion ................................................................................................................ 59

ii

List of tables Table 1 The existence of the list of components in each SBML level-version [38]–[40], [42], [73]. Y and N state existence and not existence respectively. ................................... 7 Table 2 Levenshtein ratio and edit distance variation for different names ....................... 13 Table 3 Distance and ratio between species using ABioS, Algo_1 [7] and Algo_2 [63] 33 Table 4 Comparison of similarity measurements between ABioS and Edit_dist on selected models ................................................................................................................. 35 Table 5 Number of compartment, species, and reactions in selected models................... 44

iii

List of figures Figure 1 Relationship among the all possible components in SBML model [38] .............. 9 Figure 2 SBML level 3 version1 component list and interactions with the compartment of SBML model ..................................................................................................................... 15 Figure 3 Workflow diagram for SBML merging toolkit .................................................. 17 Figure 4 Architecture of the Bioextract server (a), SBMLCompare on the iPlant Cyber infrastructure (b) ............................................................................................................... 20 Figure 5 Global mechanism for model quality checking .................................................. 22 Figure 6 Algorithm for semantic analysis ......................................................................... 22 Figure 7 SBMLChecker on BioExtract server for the reliability checking of biomodels 24 Figure 8 SBMLChecker on the web portal ....................................................................... 24 Figure 9 Global mechanism for model annotation ........................................................... 26 Figure 10 Algorithm for model annotation ....................................................................... 26 Figure 11 Annotate assistant on the web portal SW4SBMLm ......................................... 27 Figure 12 Action Flow diagram for ABioS ...................................................................... 29 Figure 13 Algorithm for finding distance between two names......................................... 30 Figure 14 Distance comparison between ABioS (orange) and existing algorithms (grey) on each biological species (Table 3). ................................................................................ 33 Figure 15 Comparison of ABioS, and existing algorithm on selected models using Hit Ratio (a) and Penalty (b) parameters ................................................................................ 36 Figure 16 Venn diagram that compares ABioS, synonym, and existing algorithm on the models iND750, iFF708.................................................................................................... 38

iv

Figure 17 Comparison of ABioS, and existing algorithm on the models iND750, iFF708 using Hit Ratio (a) and Penalty (b) parameters ................................................................. 38 Figure 18 Algorithm for comparing SBML models......................................................... 40 Figure 19 SBMLCompare as a module in a Workflow Management System and a web portal ................................................................................................................................. 41 Figure 20 SBMLCompare on BioExtract server .............................................................. 41 Figure 21 SBMLCompare on web portal SW4SBMLm .................................................. 42 Figure 22 Yeast selected models iND750 (a) iFF708 (b) for model similarity study. ...... 43 Figure 23 Reliability of SBMLCompare, and existing tool on selected models using Hit Ratio (a) and Penalty (b) parameters. ............................................................................... 44 Figure 24 Global mechanism for SBML model merging design. ..................................... 46 Figure 25 Conflict management in merging SBML models ............................................. 47 Figure 26 The complete flow diagram for merging of two models .................................. 48 Figure 27 The algorithm for merging a list of models ...................................................... 49 Figure 28 The algorithm for conflict removal and synchronization of models ................ 50 Figure 29 SBMLMerge on the web portal SW4SBMLm ................................................. 51 Figure 30 SBMLMerge on BioExtract server ................................................................... 51 Figure 31 FOCM models selected for application ............................................................ 53 Figure 32 Tools on the BioExtract server to process big models ..................................... 54 Figure 33 A total view of the web portal SW4SBMLm ................................................... 55

v

1. Introduction 1.1. Context and motivation The manipulation of XML based biological systems related representation (BioML for Bioscience Markup Language) is a big challenge in systems biology due to the large heterogeneous community involved. That challenge becomes grater with the large amount of data generated with next generation material related to biologist’s needs such as the translational study of biological systems (e.g. metabolite-gene interaction). Among these Biological Markup Languages or BioML’s (CellML [1], KGML [2], CySBML [3], GraphML [4], SBML [5]), SBML is the de facto standard file format for the storage and exchange of quantitative computational models in systems biology, supported by more than 257 software packages to date (April 2014) [6].The issue of combining biological sub-systems by merging SBML files has been addressed in several algorithms and tools [7], [8]. But it remains impossible to build an automatic merge system that implements reusability, flexibility, scalability and sharability. The SBML standard is used by several biological systems modeling tools (e.g., Matlab Systems Biology [9], Copasi [10], EPISIM [11], Virtual Cell [12]) and several databases for the representation and knowledge sharing (e.g., Biomodel [13], BRENDA [14], KEGG [15]). Most (80%) of those tools are open source [16]. SBML can be used to represent many different models (e.g., metabolic network model, neural model, spatially homogeneous/inhomogeneous model).

1

Big data and big problem for big system need flexibility There is a growing desire to integrate all biological systems (e.g., disease-gene, model in drug discovery, metabolism, gene regulation, interaction) at all levels (e.g., intracellular, cellular, tissue, organismal, environmental, epidemiological), across time (from year to nanosecond scale) and spatial (from nanometer scale to kilometer) scales. The integration is facing several challenges including the interchangeability of knowledge [17]. The merging activity in systems biology in silico approach is very useful, especially for scalability [18]. Then it could be easy to build new models from existing ones by using merging tools.

Systems Biology and Biological Systems challenges: system and data integration for model integration It is clear that systems integration is very important for biologists [19], [20]. This integration allows, for example, a molecular biologist to understand the influence of his or her work on other levels (e.g., proteomics, genomics, epidemiological) [21]–[23]. It is also clear that the integration of several biological levels makes it even more difficult to analyze biological systems that are already quite complex [24]. The question is this: how to use the power of integration without losing the efficiency of specialization for biologists? The development of workflows / pipelines has proved to be very effective. This approach aims to make available to biologists several tools (software) and resources (files, data) in the form of services. Thus, for a given biologist (molecular biologist), a given biological question (Does Frataxin in mice cause liver tumors? [25]), in a specific context, we can arrange the services (query, alignment, similarity, enzyme kinetics,

2

protein structure) relevant to the issue’s resolution by browsing in an abstract way across databases (e.g., sequence database, protein structure database, database biomodels [13], KEGG [15]) and the different biological levels (e.g., gene, metabolite, protein, mRNA). In addition to the multidimensional integration ability, this approach facilitates the sharing and reuse of the built workflows. This saves considerable time and guarantees reproducibility of analysis results. In systems biology, we have two main integration types: systems integration and data integration [26], [27]. Systems integration, of interest here, is based on the establishment of subsystems tool interconnection. These tools need standardization to facilitate collaboration and sharing in the community. Most of the biologists are interested in the SBML standard offered by W3C (World Wide Web Consortium) for the representation and exchange of biological knowledge at several levels (cell, metabolite, gene) with the help of graph theory (network) and biochemical reactions (pathway) [28]. The fast growth of biological databases (e.g., EBI [13], NCBI [29], GO [30], BRENDA [14], KEGG [15], PDB [31]) that allow for the SBML systems representation is causing several issues due to the variety of manipulation tools (cloud, big data, etc.), and their integration into the effective analysis of those biological systems [32], [33]. The use of these data in my work is of particular interest, the representation of a biological system with the SBML is closely related to the experimental biological reality available in these databases.

3

1.2. Existing work and limits Today several SBML merge algorithms and tools are available such as SBMLmerge [7] and SemanticSBML [34]. One of the most used is that proposed by EBI, the SBMLeditor. The strength of this editor resides on the ontology use and the Minimal Information Required in Annotation of Biochemical Models (MIRIAM) standard for knowledge representation in systems biology [35]. However, this editor is a standalone version and does not allow integration into the Workflow Management System (WMS) to build pipelines. This kind of SBML merging tool using the semantic web is provided on the ‘semanticSBML’ portal [34], but this tool does not provide APIs for the edition of SBML and it does not include the mapping of quantitative data (concentration, flows) needed for a good analysis of the biological system [36]. Furthermore, it does not merge automatically (without user interaction). This tool expects user suggestions to remove conflicts, otherwise it allows replications/conflicts. These existing tools mentioned above are non-deterministic, they require human interaction during model combining. This nondeterminism requires user to supply additional expertise to complete merging.

Another

important fact is, they consider mostly the name of the components to compare them. It is not a good way to match components, because a single term can have more than one different name.

4

1.3. Research goals and objectives How to merge biological models deterministically to integrate the merging process in a WMS? This work aims to propose a deterministic merging algorithm that is consumable in a given WMS engine such as the BioExtract server [37]. Our objective is driven by four goals: Goal 1: Develop biological model similarity algorithm Goal 2: Develop a deterministic merging algorithm for SBML model Goal 3: Implement merging algorithm, deploy it on iPlant and make it available on Bioextract Server. Goal 4: Create an online version of SBMLMerge.

5

1.4. Literature review a. The System Biology Markup Language (SBML) Generally, markup languages are used to represent the natural data in the form of machine understandable formal data. In this way, the well-structured markup language SBML was introduced on 2001 by sbml.org to represent biological models [5]. An SBML file represents a model that has to be used to specify the set of reactions containing in more than one compartment. A model object can be aggregated with lists of function definitions, unit definitions, compartments, species, parameters, initial assignments, rules, constraints, reactions, and events (level 3 version 1- Figure 1) [38]. However, this structure may vary from version to version. The SBML team released seven specifications with different level and version combinations. Some lists of components (the objects defined in SBML structure) and/or their properties are introduced, removed or modified by release to release (Table 1) [38]–[42]. This kind of difference should affect the tasks (e.g., comparison, merging) that deal with more than one model whenever they relate different level-version.

6

Table 1 The existence of the list of components in each SBML level-version [38]–[40], [42], [73]. Y and N state existence and not existence respectively. Level. Version List of Components 1

1.2

2.1

2.2

2.3

2.4

3.1

Function definitions

N

N

Y

Y

Y

Y

Y

Unit definitions

Y

Y

Y

Y

Y

Y

Y

Compartment types

N

N

N

Y

Y

Y

N

Species types

N

N

N

Y

Y

Y

N

Compartments

Y

Y

Y

Y

Y

Y

Y

Species

Y

Y

Y

Y

Y

Y

Y

Parameters

Y

Y

Y

Y

Y

Y

Y

Initial assignments

N

N

N

Y

Y

Y

Y

Rules

Y

Y

Y

Y

Y

Y

Y

Constraints

N

N

N

Y

Y

Y

Y

Reactions

Y

Y

Y

Y

Y

Y

Y

Events

N

N

Y

Y

Y

Y

Y

7

The SBML file is structured as a graph with nodes of components and each node in a particular graph should be the same level-version. Because of this level-version issue, any component cannot be inserted into a model if they differ on their level-version. Furthermore, components vary on their properties over level-version combinations, so we have to know the characteristics of components in each level-version before attempting to handle. Whenever there is an attempt to set unsupported properties for component, JSBML (Java library for SBML file manipulation) automatically goes to exception [43]. (e.g.: The property UseValuesFromTriggerTime of component Event cannot be set if the level-version

combination

is

lower

than

2.4).

It

will

go

to

PropertyNotAvailableException when you call setUseValuesFromTriggerTime (boolean useValuesFromTriggerTime). In order to insert a component, because of the level-version issue, we have to create a brand new component by the model and then all properties going to be set. The Biomodels database provides a facility to download a model in any level-version. However, they did not ensure the quality of models except the model with original level and version.

8

Figure 1 Relationship among the all possible components in SBML model [38]

9

How would a component be encoded with SBML? Consider a reaction in BIOMD0000000018 - Morrison1989_FolateCycle [44], ℎ

+

→ 5,10 −



In this reaction, there are two reactants (

and produce a product (5,10 −







− ℎ



,

) that react

). These three

metabolites (two reactants and a product) are species, they involve the reaction under the kinetic law,

×ℎ ×



with compartment cell, local parameter hp (of

23.2), and species FH4 (tetrahydrofolate) and HCHO. This reaction is encoded as follow in a SBML file: cell hp FH4 HCHO

10

b. Biological model comparison String Similarity Algorithm Generally, the similarity between two strings can be specified in terms of distance or proximity. The more similar strings have less distance between them, but oppositely the proximity between them is greater. Proximity can be converted into a distance or viceversa by either negation (additive inverse) or inversion (multiplicative inverse). Hamming [45] and edit distances [46] are mostly applicable measurements and both are used to determine biological similarity (e.g. The hamming distance is used in Biomolecular sequence indexing [47], inferring HIV transmission dynamics from phylogenetic sequence relationships [48] and the edit distance is used in biological sequence comparison [49], phonetic similarity checking [50]). The simplest version of edit distance was defined by Levenshtein that is also called Levenshtein distance [51]. This edit distance method allows three basic operations: a single character deletion, insertion, and the substitution of one character for another. This is a case sensitive approach, so it adds distance for case difference (e.g., edit distance between Bicarbonate and bicarbonate is 1). The edit distance between strings is only zero if the strings are identical. e.g.: The edit distance between “DHF”, “THF” can be calculated in two different ways: substitute D with T, so edit distance is one, or apply the sequence of two edit operations, delete D and insert T, so edit distance is two. We can consider the best chain of edit operations to find the minimum distance. The improved version of edit distance was modified by Damerau to allow transposition as an additional operation, but this also maintains same weights for each 11

edit operation [52]. In contrast, weighted edit distance uses different weights for operations, and also the weight depends on the actual characters being edited. The simplest version of weighted edit distance keeps constant values for each edit operation. To avoid special characters, the weights for these characters can be set to zero. The insertion weight can set to be equal to the deletion weight. The weighted edit distance approach is used in molecular biology; the Needleman-Wunsch algorithm was designed on the basis of weighted edit distance and used in bioinformatics to align protein or nucleotide sequences [53] . Another edit distance method was introduced by Jaccard that compares two strings by tokens (sub strings) [54]. This approach tokenizes the strings first and then finds the ratio between the number of shared tokens and the total number of tokens. Similarity algorithm for biomodel comparison The edit distance algorithm is preferred to use the model comparison because of its efficiency and easy implementation. The SBMLmerge developed by Marvin Schulz et. al. [7] compares metabolite names based on the Levenshtein ratio that could be calculated by the following formula: =(



! "



"

)/



! "

However, ratio is not enough to find similarity of small strings because ratios of small strings are low. Let length of big string be l and distance be d, The ratio, strings ( ≥

= %1 − & ' will be nearly one if l is very big. Therefore, for accept the big ) ),

threshold value for r,

)

should be high. This increased value may fail to

12

accept matching names. In contrast, if matching strings are very small,

)

should be low

and it may accept unmatched names. e.g., consider the names, “thf”, “Thf” both represent the metabolite tetra hydro folate. "

( ℎ , *ℎ ) = 1, and

( ℎ , *ℎ ) = 0.66. )

In order to accept these kinds of names

some unmatched names that have ratio ≥ 0.66.

should be ≤ 0.66. It also may bring

Similarly only taking distance should affect the big string comparison, it needs to increase the threshold of distance (Table 2). Table 2 Levenshtein ratio and edit distance variation for different names Name A

Name B

Distance

Ratio

Bicarbonate

bicarbonate

1

91.0

Glucose-6-phosphate

Glucose six phosphate

5

76.0

Glucose-6-phosphate

Glucose-six-phosphate

3

86.0

L-lysine

L-Lysine

1

88.0

D-glutamate

G-Glutamate

2

82.0

Serine

serine

6

83.0

We can also take the different metabolites “5 10 methylenetetrahydrofolat”, “5 methyltetrahydrofolate” as an example for big name comparison. Even though they are different names, the ratio between them is high (0.89). In order to reject these names, we have to increase the threshold ratio. It may also reject matching names. "

(5 10

(5 10









, 5

, 5

13









) = 6,

) = 0.89.

and

As this way of argument, we have to take both distance and ratio into account. However, any of the above described edit distance algorithms (simplest version or weighted version) cannot be applied directly to compare metabolite names. This is because metabolite names should be analyzed with including of several special facts such as sub-name comparison, number format management, abbreviation, and special character focusing etc. A new algorithm have been designed for this purpose. c. The semantic in biological modeling There are several organizations (EBI [13], NCBI [29]) maintaining databases ( Biomodels [13], Protein, Gene, etc.) and/or ontologies (gene ontology [30]) in order to manage biological components (e.g., reaction, species, etc.) in a standard way. They try to categorize the already defined components and identify relationships among them. Each database assigns unique id to each element and keeps relevant important details (e.g. properties, description) with them. Furthermore, we can find several web applications (e.g., KEGG Mapper) that provide services to map the same components from different places [15]. Some of them provide web services especially RESTful services that could be used by software tool developers (web services for GO terms and annotation provided by EBML-EBI) [55]. A single component can be annotated by multiple databases and or ontologies. The SBML defines annotation-tag to annotate biological components, it has resources with the details of database and id for each annotation [38]. e.g. The reaction MTHFR, [5,10-methylene-tetrahydrofolate] + [NADPH] → [5-methyl-tetrahydrofolate] in BIOMD0000000018 has the annotations "urn:miriam:ec-code:1.5.1.20", "urn:miriam:kegg.reaction:R01224". This reaction has Enzyme id 1.5.1.20 and KEGG reaction id is R01224. 14

We can choose these kinds of IDs to uniquely identify biological components specified in biomodels (SBMLCompare).

1.5. Methodology and algorithm My method resides in the integration of the semantics and the intuition to provide innovative deterministic toolkit for SBML merging operations.

Figure 2 SBML level 3 version1 component list and interactions with the compartment of SBML model The compartment object has strong relationship with many other components defined in SBML (Figure 2) [38]. This object inherits unit from model object and has the properties: id, name, spatial dimension, size, units, and constant. The property units relates to unit object, and the size can be assigned by initial assignment, rules (assignment rule, algebraic rule, rate rule), or event assignment externally. However, it is possible to assign only if the constant is false except initial assignment. Function definition and constraint can use compartment as a variable. The reaction and species maintain relationship with compartment by holding the id. 15

SBML models, computational models of biological process, are represented by combinations of fact data structures, usage principles, and serialization to XML. These models literally represent reactions which are related with a pool of species that are located in containers called compartments. Furthermore, for a complete picture, several individual components, objects with attributes, and lists of components are structured in the model. Those components have strong relationships among them (e.g. Figure 2, how a compartment relates with others). Any changes, therefore, may affect more than one component relatively. So, merging two individual components’ lists influences many others. We have to consider each and every individual component carefully to merge models. And we also have to ensure that those are not in conflict (i.e. there are no clashes with identification and/or actual component). We can introduce two main tasks, comparison and refactor merging to find matching terms and apply changes to all connected components respectively. For the comparison, it is not enough to match with names by using string matching algorithms because the same term may be named in totally

different

ways

(e.g.,

0

is

named

as

Formaldehyde

in

Reed2008_Glutathione_Metabolism and HCHO in Morrison1989_FolateCycle). The Same terms, however, in a specific database have a common index (e.g. C00067 in KEGG has several names, Formaldehyde, Methanal, etc.). Even though components in a database have unique identifiers, different components may be annotated with same identification (e.g., both “m_THF + m_Serine ↔ m_Glycine + m_5-10-methylene-THF” in

“Reed2008_Glutathione_Metabolism”,

and

“tetrahydrofolate + serine → 5,10-

methylene-tetrahydrofolate” in “Morrison1989_FolateCycle” are annotated with same enzyme commission number EC 2.1.2.1 by enzyme nomenclature).

16

Based on this fact, I planned to consider both name and URL identification for the comparison as syntax and semantic validation. Levenshtein edit distance algorithm is used to implement the metabolite name comparison, but it cannot be used to match different styles of metabolite naming (e.g. “2-Deoxy-D-arabino-hexose” and “D-arabino2-Deoxyhexose” have same KEGG ID “C00586”, but the edit distance yielded by this algorithm is heigh,15) [7]. Therefore, a new approach is designed on the basis of Leven’s edit distance algorithm. The model merging system is designed with integration of four sub modules, each of which plays a specific role to ensure a fruitful merging of SBML models (Figure 3). These sub modules are SBMLChecker, SBMLAnot, SBMLCompare, and SBMLMerge. 2

1

SBMLChecker

Is Reliable

Is Reliable

N

Improve reliability level

Y

N

Y

SBMLCompare N

SBMLAnot

Is Mergable

N

Y

SBMLMerge 12

SBMLChecker Is Reliable N

N Y

Reliable

12

Figure 3 Workflow diagram for SBML merging toolkit

17

Improve reliability level

SBMLAnot

SBMLChecker

Each model has to be ensured for its reliability. If it is not reliable, its quality has to be improved. The model reliability checking and quality enhancement is repeatedly processed, by the modules SBMLChecker, SBMLAnot respectively, until it meets the needed reliability level. In order to decide for merging, the two models have to be compared by using the SBMLCompare. Finally, they are merged together by the module SBMLMerge and the reliability checked. Again and again the models will be improved on their reliability and will be traversed through the same path until a valid merged model is obtained. a. SBML merge principles Semantic similarity is integrated into the model comparison method. SBML models are merged by comparing each component with an automated process using refractor synchronization management b. SBML merge implementation The IDE NetBeans (7.3) was used to fulfill everything related to coding, and the JSBML library (jsbml-0.8-with-depenedencies) was used to manipulate SBML file objects [56]. A web interface was designed by UI components and tags of PrimeFaces (3.5) and logically connected with SBML file processors by using JSF (2.1) managed beans. The J2EE 6.0 architecture and JDK 1.7 java library were used for this development [57][58]. The development was deployed on the server GlassFish Server 3.1.2 and tested frequently. Furthermore, the supporting libraries for file uploading: commons-fileupload (1.2.2), commons-io (1.4), the library to handle excel file: apache poi, and any other relevant libraries were included.

18

The JSBML is a flexible and entirely java-based library for working with SBML. This library supports all SBML Levels and Versions through Level 3 Version 1, and it has been strived to maintain the highest possible degree of compatibility with the popular library libSBML [56]. JSBML also supports modules that can facilitate the development of plugins for end user applications, as well as ease migration from a libSBMLbased backend. The interface was designed using PrimeFaces. PrimeFaces is a lightweight open source component suite for Java Server Faces 2.0 featuring more than one hundred JSF components [59]. This can be used by a single jar without any additional dependencies and configurations. They provide guides in several ways such as showcase with sample codes and documentation with clear explanation. It supports native Ajax Push/Comet and provides built-in Ajax based on standard JSF 2.0 Ajax APIs. Javadoc was chosen for documentation that is a standard tool for generating java API documentation [60]. It facilitates documentation in HTML format from java source code. This can generate documentation from Java source code easily and rapidly without any additional codes.

19

(b)

(a)

Figure 4 Architecture of the Bioextract server (a), SBMLCompare on the iPlant Cyber infrastructure (b)

The developed modules are integrated into the Bioextract Server (4.a) leveraging the iPlant collaborative resources (4.b), and are available for the iPlant and Bioextract Server users [37], [61].

20

2. The SBMLChecker for reliability checking Automatic merging has little value if the given models are not consistent. Since the system cannot identify the exact matching components, it cannot remove conflicts. Therefore, before the merging process, models have to be checked for their quality. The quality checking should ensure their correctness on syntax and strength on semantic. The Online SBML validator introduced by SBML.org provides the services to test syntax and internal consistency of an SBML model. This automatic system validates the following facts [62]: -

Check consistency of measurement units associated with quantities (SBML L2V4 rules 105nn)

-

Check correctness and consistency of identifiers used for model entities (SBML L2V4 rules 103nn)

-

Check syntax of MathML mathematical expressions (SBML L2V4 rules 102nn)

-

Check validity of SBO identifiers (if any) used in the model (SBML L2V4 rules 107nn)

-

Perform static analysis of whether the model is over determined

-

Perform additional checks for recommended good modeling practices

-

Perform all other general SBML consistency checks (SBML L2V4 rules 2nnnn; highly recommended)

However, this system does not consider the entire annotation information to evaluate the meaning of models. In order to analyze the semantics and syntax of models, a tool is designed that extends the web services provided by the online SBML validator.

21

2.1. Principle The reliability level of a model is calculated based on its validity of syntax and semantics. Correctness of models on syntax is examined with the usage of web services provided by the online SBML validator. Semantic strength is measured by the annotated URL id of each model’s component.

2.2. Design and algorithms for SBMLChecker SBMLChecker does two way analysis, one for semantic strength and another for syntax correctness (Figure 5.), and generates reports R1, R2 respectively.

Figure 5 Global mechanism for model quality checking

Figure 6 Algorithm for semantic analysis

22

The semantic analyzer takes each kind of component separately and identifies all ontologies and databases that are used to annotate it. For example, if any species is annotated with KEGG id, KEGG will be used to check the annotation of all other remaining species. Then the percentage of KEGG annotation will be calculated. Species are considered being more consistent/reliable if the percentage is high. In this way, percentage of every possible annotated ontologies/databases will be calculated. The maximum percentage will decide the consistency level. Consistency for the components ( databases, 3

=

where,

3)

5 (∀7 , 0 ≤ ≤ 7

of kind k over

43

ontologies and/or

( 7 ))

43 ,

is an ith ontology or database

Finally, cumulative consistency is calculated by taking the average consistencies of each kind of component. Consistency of model m, 8

=

∑∀3

3 3

The number of components,

= ∑∀3

3

In addition to the consistency report, an error report is generated by combining the online SBML validator’s error report with my own semantic check error report. Based on the quality checking, the user will be required to provide a valid model, but it can be skipped if they want.

23

2.3. Implementation

Figure 7 SBMLChecker on BioExtract server for the reliability checking of biomodels

Figure 8 SBMLChecker on the web portal

A Java program named SBMLChecker.jar is designed for reliability checking. This can process SBML files received through the command line parameter argument, and writes the generated report in excel, text, and xml formats. The SBMLChecker was deployed on the HPC (High Performance Computing) infrastructure iplant for availability on BioExtract server (Figure 7), and was integrated on the web portal SW4SBMLm (Figure 8).

24

2.4. Validation The model BIOMD0000000018 is examined for reliability by SBMLChecker. According to the results, it is syntactically valid and earned a semantic score of 79%. This semantic score comes from: compartment 100%, species 93%, and reaction 44%. The model’s reliability can be improved by annotating it. Another system SBMLAnot have been introduced to support users who want to improve the consistency level of the model. The next section describes the annotation assistant tool.

2.5. SBMLAnot for model annotation The merging approach uses the information from annotations to make decisions about the components’ similarities. Well annotated models can be merged automatically without losing any data. In order to ensure accurate merging, a user has to supply well annotated models a. Principle Each component is checked and appropriate ontologies and/or databases are found. An equivalent term is selected from a list of possible suitable terms using the provided RESTful web services, and it will be used to annotate. b. Design and algorithm for SBMLAnot In the case of failure to give a valid model, SBMLAnot will support converting it to a valid model. This works in two ways: quick annotation, the automatic annotation of the whole model, and advanced annotation, the annotation of each component by user interaction. For the advanced annotation, the user is required to select appropriate values 25

from a list of possible terms. The user can annotate components one by one until they decide the consistency level is enough for merging.

Figure 9 Global mechanism for model annotation

Both ways find an appropriate term for an component from a suitable ontology/database, and annotate it using the provided web services (Figure 9). This process is repeatedly executed for annotating each component, but using advanced annotation, the user can skip annotation of any components. The appropriate algorithm for the automatic quick annotation is shown in the Figure 10.

Figure 10 Algorithm for model annotation

26

c. Implementation The module SBMLAnot is integrated into the web portal SW4SBMLm. SBMLAnot can display lists of components (compartments, species, and reactions) with annotated databases and/or ontologies. A user can annotate each component one by one using the provided interface (Figure 11).

Figure 11 Annotate assistant on the web portal SW4SBMLm

d. Validation The biomodel BIOMD0000000018 is loaded into the annotate assistant and annotated component by component. For example, the term cell is annotated with “GO%3A0005623” for the compartment cell of this model (Figure 11).

27

3. Algorithm for BioML Similarity (ABioS) Biological models in BioML format (e.g. SBML [5], CellML [1]) deal with species (compartment, metabolite, gene, enzyme, transcript, etc...) named in a variety of ways by different biologists or communities. The key problem remains how to find similar species within the models. Several algorithms have been designed to find similarity between two strings using different measures. The algorithms applied to biological species show weak similarity results compared to visual similarity checking [7], [63]. This is because those algorithms do not consider some important facts in biological species naming. A single metabolite can be identified by different names that vary slightly in their string patterns. In addition, the integration of special symbols or digits significantly reduce the consistency of string manipulation algorithms because of homologous naming, subnames chaining, sub-names pattern permutation, and sub-names grouping or splitting using special symbols. Here, a novel, intuition-based approach algorithm is introduced in species similarity evaluation processing. This algorithm integrates the analysis of subnames, and a symbol management strategy that conserve the species name specifications.

3.1. Principle Analyze the strings individually by these techniques: split the whole name into several sub names by the special characters (hyphen, underscore, space, comma, and brackets), convert the digits to words, ignore the case sensitivity, and permute and match sub-names. Use the following heuristic techniques: •

Set distance with length of the big string if the strings only vary by their prefix character that represents isomers.

28



Match the short form with these: first letters of sub names, letters in considerable gap, and capital letters in a name.

3.2. ABioS design Strings :; and :? are split using special character markers, and numerical numbers used in the name are worded (‘6‘ worded to ‘six’). The distance is calculated using Leven distance for all possible permutations, and the minimum distance is taken to be d [51],

[63]. Finally, the decision will be made based on the parameters , , @ , and ) . A higher distance indicates a lower similarity quality.

Figure 12 Action Flow diagram for ABioS

29

Procedure for finding distance between two strings

Pre-condition: The given strings, :; A B , :? A B and :; A :? :

CDE

:; , :?

"" !

F

Post-condition: distance (≥ 0)

",

F

GH

Making the decision, whether strings are similar or not ":

(:; , :? ) =

Here, the distance,

L

K J

F , ( " 1MH (





(:; , :? ) ≤ (:; , :? ) ≥

F I F

@, )

), )

" , /N

= min (∀7,Q GH(R7 (:; ), R7 (:? ))

GH(:; , :? ) = Leven Edit Distance between the strings :; , :? = maxT

Ratio,

@ , ) :

ℎ(:; ),

= ( − )⁄

ℎ(:? )U

threshold value for distance and ratio respectively

Figure 13 Algorithm for finding distance between two names

30

The similarity checking between two names consumes the time (W!) where

W = max %

ℎ(:

; ),

ℎ(:

? )'.

This is the looping time, it loops for the

number of possible permutations of a name string that has the most sub-names. The memory space taken by this algorithm depends on size of the names, so space complexity can be expressed as %max %

ℎ(:; ),

ℎ(:? )''.

The similarity checking for species in two biomodels depends on the amount of species they have. The time complexity is respectively. Similarly it takes (max(

TmaxT , UU if models have m, n species

, )) memory space on average.

ABioS key points •

Special characters: The special characters (hyphen, underscore, space, comma, and brackets) are used to split the whole name into several sub names. Glucose-6phosphate can be split into three sub names Glucose, 6, and phosphate by using the character hyphen (‘-’)



Number representation - Word versus digit: The metabolite 2-Phospho-Dglycerate is 2-Phosphoglyceric acid, or 2-phosphoglycerate and is a glyceric acid which serves as the substrate in the ninth step of glycolysis. The metabolite 3Phospho-D-glycerate is 3-Phosphoglyceric acid, or glycerate 3-phosphate and is a biochemically significant 3-carbon molecule which is a metabolic intermediate in both glycolysis and the Calvin cycle. Based on edit distance algorithm, those different chemicals have a distance of only one (distance between “2”, ”3”), but they can be significantly increased in distance to four by converting digits to words (distance between “two”, “three” is four) 31



Prefix name: Join a sub name if it is a prefix name of following name. In Glucose-6-phosphate, the sub names six and phosphate will be joined together as sixphosphate because six is a prefix name of phosphate. Prefix names represent numeric values, number of occurrences (e.g. mono, di, tri etc.), or isomers (D, L, S, R, α, and β).



Sub-name permutation: Consider the same species names: 2-Deoxy-D-arabinohexose and D-arabino-2-Deoxyhexose [7]. The 2-Deoxy-D-arabino-hexose can be split three sub names 2Deoxy, Darabino, and hexose. And the D-arabino-2Deoxyhexose also can have two sub names: Darabino and 2Deoxyhexose. By permuting the first three sub names, we can get the same name as the second. (Darabino, 2Deoxy, hexose)



Ignore case: Exactly equal names of the same components may not be matched due to case sensitivity. This issue can be solved by ignoring case sensitivity (There is no distance between Bicarbonate, bicarbonate, even though the first letters are not in same case)



Heuristics Techniques: •

A single letter variation – set distance with length of the big string if the strings only vary by their prefix character that represents isomers.



Short form - name in short form could be matched with : string of first letters of sub names (Tetra hydro folate), letters in considerable gap (Tetrahydrofolate), capital letters in a name (TetraHydroFolate)

32

3.3. Application of ABios on FOCM The different style of names is selected based on the lacks of the two existing algorithms [7], [63].

T T T T T T F F F T T T T T T T

Bicarbonate Glucose-6-phosphate Glucose-6-phosphate L-tryptophanyl-tRNAtrp L-lysine D-glutamate abcd Singh2006_TCA_Ecoli_acetate 2-Phospho-D-glycerate 2-Deoxy-D-arabino-hexose 5,10-methylene-tetra5,10-methylenec_DHF α-D-Glucose N-acetylglucosamine D-Glucose,2-(acetylamino)-2-

bicarbonate Glucose six phosphate Glucose-six-phosphate L-Tryptophanyl-tRNA(trp) L-Lysine G-Glutamate aZcZ Singh2006_TCA_Ecoli_glucose 3-Phospho-D-glycerate D-arabino-2-Deoxyhexose 5,10-methylene-tetrahydrofolate 5,10-methylene-THF m_DHF β-D-Glucose NAG 2-Acetylamino-2-deoxy-D-

deoxy-

glucose

Algo_2

ABioS

Similar

YZ

Algo_1

Table 3 Distance and ratio between species using ABioS, Algo_1 [7] and Algo_2 [63]

d 0 0 0 0 0 1 2 6 4 0 0 0 0 11 0

r 1.00 1.00 1.00 1.00 1.00 0.91 0.50 0.88 0.86 1.00 1.00 1.00 1.00 0.00 1.00

d 1 5 3 3 1 2 2 6 1 15 1 16 1 1 18

d 1 5 3 3 1 2 2 6 1 15 1 16 1 1 18

r 0.91 0.76 0.86 0.88 0.88 0.82 0.50 0.78 0.95 0.38 0.97 0.48 0.80 0.91 0.05

0

1.00

23

23 0.34

Figure 14 Distance comparison between ABioS (orange) and existing algorithms (grey) on each biological species (Table 3). 33

Figure 14. presents the distances calculated using ABioS and existing algorithms (Table 3). The ABioS shows lower distances compared to other algorithms. They hit 100% for the range of 0:1 with no wrong hits. In contrast, existing algorithms only hit 46% for the same range (0:1), but they earn 100% hits for range 0:23 with 33% wrong hits. By taking both distance and ratio, ABioS achieves 100% hits with no wrong hits

for ≤ 1

> 0.80. The existing algorithm Algo_1 hits 69% with 19% wrong hits

for meaningful values with ≤ 23

> 0.50

Description of Terms Hits: Number of correct similarities identified Misses: Number of similarities that were not identified Hit ratio: The ratio between Hits and number of visual similarities Miss ratio: The ratio between Misses and number of visual similarities Wrong hits: Number of similarities that were identified wrongly Wrong hit ratio: The ratio between Wrong hits and Hits Penalty: Sum of the Miss ratio and Wrong hit ratio

Three FOCM models that are used in my lab: BIOMD0000000018, BIOMD0000000213, and BIOMD0000000268 are chose to study their similarity.

34

Table 4 Comparison of similarity measurements between ABioS and Edit_dist on selected models B_18, B_268

B_213, B_268

ABioS

Edit_dist

ABios

Hits

12

8

10

8

10

8

Misses

2

6

3

5

4

6

Hit ratio

0.8571

0.5714

0.7692

0.6154

0.7143

0.5714

Miss ratio

0.1429

0.4286

0.2308

0.3846

0.2857

0.4286

1

4

2

2

3

2

Wrong_hit ratio

0.0769

0.3333

0.1667

0.2000

0.2308

0.2000

Penalty

0.2198

0.7619

0.3974

0.5846

0.5165

0.6286

Wrong_hits

(a)

Ratio

B_18, B_213

Edit_dist ABioS Edit_dist

Hit Ratio 1.0000 0.9000 0.8000 0.7000 0.6000 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000

0.8571 0.7692 0.5714

0.7143 0.6154

0.5714 ABios Edit_Dist

B_18, B_268

B_18, B_213

B_213, B_268

Sample pair of biomodels

35

Ratio

(b)

Penalty 0.9000 0.8000 0.7000 0.6000 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000

0.7619 0.6286

0.5846 0.5165 0.3974

ABios 0.2198

Edit_Dist

B_18, B_268

B_18, B_213

B_213, B_268

Sample pair of biomodels

Figure 15 Comparison of ABioS, and existing algorithm on selected models using Hit Ratio (a) and Penalty (b) parameters

Average value of hit ratio and penalty for the ratio of misses and wrong hits have been calculated using ABioS and existing algorithm (Table 4). Figure 15 shows that ABioS is better on each pair of models on the similarity parameters hit ratio and penalty. ABios achieves 78% accuracy on similarity findings and improves 24% on average on the three FOCM models from Biomodels database. Accuracy of this algorithm improves existing algorithm by over 20% with time complexity better than and

(

5( ,

)) for

biological species models.

3.4. ABioS algorithm validation on Yeast models In order to validate the ABioS algorithm on big models, we chose the yeast models iND750 [64] and iFF708 [65] that have more than 1000 metabolites. A list of synonyms can be retrieved from BioCyc database [66] for such models. However, it is not possible to get synonyms for all kinds of models, so the string comparison algorithm ABioS is useful for general model comparison. 36

The two SBML yeast

models are downloaded from the BIGG database [67], and tested with ABioS following these steps: 1. Download SBML yeast models (Supplement: IND750.xml and iFF708.xml) 2. Extract a list of metabolites (Appendix A, Code1) iFF708 has 1177 metabolites and iND750 has 1523 metabolites 3. Extract a list of synonyms for each metabolite using RESTful API (Appendix A, Code2, Synonyms.xslx) •

Retrieve KEGG (http://rest.kegg.jp//find/compound/metabolite) ID for each metabolite. [68]



Retrieve MetaCyc ID (http://websvc.biocyc.org/META/foreignid?ids= KEGGID) in BioCyc portal [66] for each metabolite using KEGG ID



Retrieve XML output file for a metabolite using BioCyc ID



Extract synonyms from the XML file (4875 and 5520 synonyms for IND750 and iFF708)

4. Compare two lists of metabolites corresponding to the models (Appendix A Code3, Compare.txt, compare.xslx) Two metabolites are determined to be truly similar if and only if any pair of synonyms matches 5. Run the string similarity algorithm (ABioS) on the list of elements of the models •

Only string similarity (compare.xslx)



Compare the true similarity with the output of ABioS and existing algorithm Algo_1.

37

The result is compared by the venn-diagram, the hit ratio and penalty are calculated and graphed as shown below:

Figure 16 Venn diagram that compares ABioS, synonym, and existing algorithm on the models iND750, iFF708 The ABioS hits 254 similar metabolites over 666 metabolite pairs that have been identified by synonyms. (b)

Hit ratio

(a) 0.4500

0.3814

0.4000 0.3500

0.8000 0.7000

0.6578

0.6000

0.3000

0.5000

0.2500

0.3947

0.4000

0.2000

0.3000

0.1500 0.1000

Penalty

0.2000

0.0511

0.1000

0.0500 0.0000

0.0000

Algo_1

ABioS

Algo_1

ABioS

Figure 17 Comparison of ABioS, and existing algorithm on the models iND750, iFF708 using Hit Ratio (a) and Penalty (b) parameters

The algorithm ABioS hits (0.38%) more than existing algorithm (0.05%) with less penalty (0.39% < 0.66%). It clearly improves the result on metabolites comparison using their names.

38

4. The SBMLCompare for model comparison The growth of bio-systems model development encourages automatic approaches for model similarity evaluation that also supports systems biologists. Several algorithms have been proposed (e.g. semanticSBML [69], SBMLmerge [7]), but they lack in efficiency. I have developed an efficient, intuition-based approach algorithm, ABioS (Algorithm for BioML Similarity evaluation). The fundamental principle of my ABioS algorithm is the analysis of individual portions of model component names as strings in order to find accurate similarities. I developed SMBLcompare, an ABioS implementation for automatic bio-systems model comparison in SBML format. This implementation has been integrated into the Bioextract Server (bioextract.org) and aims to be consumed within workflows designed to address e-science challenges. A component can be identified by more than one id in the same database. For example,

in

BIOMD0000000018,

the

reaction

MTHFD

[5,10-methylene-

tetrahydrofolate] + [NADP] → [10-formyl-tetrahydrofolate] has two KEGG ids R01655, R01220 and EC numbers 1.5.1.5 and 6.3.4.3 but R01655 is mapped with 3.5.4.9 and R01120 with 1.5.1.5. At the same time, the reaction V_MTHFR, [c_5-10-methyleneTHF] + [NADPH] ↔ [c_5-methyl-THF] on BIOMD0000000268 was annotated with KEGG id R01220, but the reactions MTHFD and V_MTHFR are not the same. However, the reaction MTFHR, [c_5-10-methylene-THF] + [NADPH] ↔ [c_5methyl-THF] on BIOMD0000000018 is same as V_MTFHR but it has different ids R01224, 1.5.1.20 of KEGG and E.C number respectively. The reactions are not matched even though they have the same EC or KEGG ID.

39

Furthermore,

SHMT,

SHMTr

in

BIOMD0000000018

and

VmSHMT

in

BIOMD0000000268 have same KEGG ID R00945, but they are actually different reactions

[tetrahydrofolate] + [serine] → [5,10-methylene-tetrahydrofolate],

methylene-tetrahydrofolate] → [tetrahydrofolate],

[5,10and

[m_THF] + [m_Serine] ↔ [m_Glycine] + [m_5-10-methylene-THF] respectively. As the result of this analysis, we cannot use only the name and/or the URL id to compare reactions. In contrast, we have to consider each species that relates the reaction.

4.1. Principle The models are compared with component’s name and ontology ID for their similarity evaluation. In order to compare reactions, all relevant species have to be matched (100% similarity score).

4.2. Design and algorithms for SBMLCompare In order to find similarity between two models, each list components of a model is compared with specific lists of another model. Each appropriate pair is checked for their name similarity and ontology ID matching (Figure 18). Finally, the comparison result is reported.

Figure 18 Algorithm for comparing SBML models

40

HPC (iPlant)

SBMLCompare a jar file

WMS(BioExtract server)

Web portal (SW4SBMLm)

Figure 19 SBMLCompare as a module in a Workflow Management System and a web portal

SBMLCompare is a java application, executable as a jar from the command line. This module could be embedded into any software application. This is used in a web portal to manipulate SBML files, and deployed on the HPC infrastructure iPlant to study similarity of very big models. Through the iPlant collaborative cyber infrastructure, it can be used in design of workflows in the BioExtract Server [37].

4.3. Implementation

Figure 20 SBMLCompare on BioExtract server

41

Figure 21 SBMLCompare on web portal SW4SBMLm

For the studies of big model similarity, the tool is integrated into the BioExtract server (bioextract.org) [37] which leverages iPlant collaborative resources. This allows the tool to be utilized by WMS (Figure 20). The tool is also embedded with the web portal SW4SBMLm (Figure 21) and exists to compare small models by any users.

4.4. Application on yeast (e-coli) mathematical models comparison

(a)

42

(b)

Figure 22 Yeast selected models iND750 (a) iFF708 (b) for model similarity study.

Yeast is used in commercial and pharmaceutical production, it is a good model for eukaryotic organisms and has been written about in over 40,000 publications. Many genome-scale yeast metabolic models are available in literature or databases. These models are big and fully compartmentalized; iND750, for example, consists of 1498 reactions for 750 ORFs [64]. These models include genomics, phenomics, and metabolomics data in their design process. In order to address a good systems biology question related to yeast, researchers need to understand the similarity between these models. SBMLCompare has been successfully used on four genome-scale yeast metabolic models iLL672, iND750, iJR904, and iFF708. The similarity result showed a significant improvement compared to existing related work (over 10%).

43

Table 5 Number of compartment, species, and reactions in selected models Models

Measurement Approach iND750,iJR904

iND750,iFF708

iJR904,iFF709

ABios

0.9916

0.9934

0.9934

Edit_Dist

0.9451

0.9735

0.9571

ABios

0.0101

0.008

0.0086

Edit_Dist

0.0549

0.0285

0.0463

Hit ratio

Penalty

(a)

Hit Ratio 1.1 1.05

Ratio

1 0.95

0.9934 0.9735

0.9916 0.9451

0.9934 0.9571 ABios

0.9 Edit_Dist

0.85 0.8 iND750,iJR904 iND750,iFF708 iJR904,iFF709 Sample pair of biomodels

(b)

Penalty 0.07

Ratio

0.06 0.05

0.0549 0.0463

0.04 0.03 0.02 0.01 0

0.0285 ABios 0.0101

0.0086

0.008

Edit_Dist

iND750,iJR904 iND750,iFF708 iJR904,iFF709 Sample pair of biomodels

Figure 23 Reliability of SBMLCompare, and existing tool on selected models using Hit Ratio (a) and Penalty (b) parameters.

44

Figure 23 presents the average hit ratio and penalty of SBMLCompare and the existing tool. It shows that SBMLCompare is better on each pair of models, on the similarity parameters hit ratio and penalty. SBMLCompare achieves 99% accuracy on reliability and improves 10% averagely on average on these three yeast models.

45

5. SBMLMerge for the model merging 5.1. Principles The SBML models are merged by comparing each component. Conflict management, models synchronization and a refactoring system convert the models to be mergable. Integration of mergable models requires removing duplicates.

5.2. Design and algorithms for SBMLMerge

Figure 24 Global mechanism for SBML model merging design.

Figure 24 presents the global mechanism used for the merging process. The objective is to have the most modular algorithm possible that can be supported in API and command line execution. The algorithm needs to be flexible enough to involve ontology annotation (meaning). The model consistency is required to provide a valid output merged model. I have the two main procedures (comparison, merge) and two internal procedures for comparison that are used to match syntax and semantic (string, URL ID). My strategy consists of creating a new model and filling it with components from input models through comparison driven conflict management. The process consists of 46

four main steps: model initialization, component synchronization (conflict management), components list merging (with refactor management), and output model validation using SBML validator API. One of the most innovative points in this approach is the conflict management during component synchronization. In that process, trivial redundant components are just removed, but identified matched components face several issues: like conflict (name, Id, meaning), modification, selection, and external influence (e.g. component value from a different scope). These issues will be managed as follows:

a) Both As are semantically

b)

equal

semantically equal

Both

A

and

X

are

c) As are not semantically equal

Figure 25 Conflict management in merging SBML models

Identification within an SBML model can map in three different cases: two components can be merged together, if both are semantically equal (a, b), we can assign a new id for one of them, if two ids are matched when both are not semantically equal (c).

47

∶ ∶ I a Ia ∶ I ad ∶ " a7 ∶ = F a

Start

; ,

c " " !

c

c c f " "

?

; >

;? = ? I;? = I?

F

?

T ;? = ; I;? = I; ;?

=

g

( ;? , I;? )

=0