DEVELOPING ONTOLOGIES IN THE BIOLOGICAL

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

DEVELOPING ONTOLOGIES IN THE BIOLOGICAL DOMAIN

A thesis submitted for the degree of Doctor of Philosophy at The University of Queensland, Institute for Molecular Bioscience Alexander García Castro September 2007

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

STATEMENT OF ORIGINALITY I declare that the work presented in this thesis is, to the best of my knowledge and belief, original and my own work, except as acknowledged in the text. The material (presented as my own) has not been submitted previously, either in whole or in part, for a degree at this or any other institution.

Alexander García Castro

STATEMENT OF CONTRIBUTION OF OTHERS In those cases in which the work presented in this thesis was the product of collaborative efforts I declare that my contribution was substantial and prominent, involving the development of original ideas as well as the definition and implementation of subsequent work. Detailed information about the participation of other researchers in parts of this thesis is provided in the section "Author's contributions" at the beginning of each chapter.

Mark A. Ragan

Alexander García Castro

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

UNIVERSITY OF QUEENSLAND ABSTRACT Developing Ontologies In The Biological Domain by Alexander Garcia Castro Chairperson of Supervisory Committee: Professor: Mark Ragan The development of "omic" technologies and its applications into biological sciences has increased the need for an integrated view of bio-related information. The flood of information as well as the technological availability has made it necessary for researchers to share resources and join efforts more than ever in order to understand the function of genes, proteins and biological systems in general. Integrating biological information has been addressed mainly from a syntactical perspective. However, as we enter into the post-genomic era integration has acquired a meaning more related to the capacity of inference (finding hidden information) and sharebility in large web-based information systems. Ontologies play a central role when addressing both syntactic and semantic aspects of information integration. The purpose of this research has been to investigate how the biological community could develop those highly needed ontologies in a way that ensures both, maintainability and usability. Although the need for ontologies, as well as the benefits of having them, is obvious; it has proven to be difficult for the biological community not only to develop but also to effectively use them. Why? How should they be developed in such a way that they are maintainable and usable by existing and novel information systems? A feasible methodology elucidated from the careful study whilst developing biological ontologies is proposed. Methodological extensions gathered from the acquired experience are also presented. Throughout the chapters of this thesis diverse integrative approaches have also been analysed from different perspectives; a workflow-based solution to the integration of analytical tools was consequently proposed. This made it possible to better understand the need for welldefined semantics in biological information systems as well as the importance of a thoughtful

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

understanding of the relationship between the semantic structure and the syntactic scaffold that should ultimately host the former. The role of communities in the construction of biological ontologies as well as the argumentative structure that takes place during the development and maintenance of them have been extensively studied in this thesis. What is the role of the domain expert when developing ontologies within the biological domain? Different scenarios in which ontologies were developed have been studied in order to answer this question. The relationship between domain experts and knowledge engineers was analyzed during the development of loosely centralised ontologies. As a consequence of those direct experiences developing ontologies a viable use for concept maps supporting collaboration and annotation was anticipated; consequent software developments are also part of this investigation. From this investigation several conclusions have been drawn, one of them with a particular significance is the relevance of collaboration between two asymmetric, yet not antagonist, communities; computer scientists and biologists may work and achieve results in different ways, nevertheless both communities hold valuable information that could be of mutual benefit. Within the context of biological ontologies "Romeo and Juliet" proved to be an apt metaphor that illustrates not only the importance of the collaboration, but also how we may avoid heading towards "A hundred years of solitude".

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

TABLE OF CONTENTS

TABLE OF CONTENTS .............................................................................................................I LIST OF FIGURES ................................................................................................................ VII LIST OF TABLES ..................................................................................................................... X ACKNOWLEDGMENTS......................................................................................................... XI INTRODUCTION ................................................................................................................. XIII OVERVIEW ............................................................................................................................ XIII WHAT IS AN ONTOLOGY?........................................................................................................ XIV CONTROLLED VOCABULARIES AND ONTOLOGIES ...................................................................... XVI WHY ONTOLOGIES?............................................................................................................... XVII WHY COMMUNITIES? ........................................................................................................... XVIII BRINGING IT ALL TOGETHER .................................................................................................... XX RESEARCH PROBLEM ............................................................................................................. XXII CONTRIBUTIONS OF THIS THESIS............................................................................................ XXIII OUTLINE OF THIS THESIS.......................................................................................................XXIV PUBLISHED PAPERS ..............................................................................................................XXVI SOFTWARE DEVELOPED, INCLUDING ONTOLOGIES. ................................................................XXVII REFERENCES ......................................................................................................................XXVII 1

CHAPTER I - COMMUNITIES AT THE MELTING POINT WHEN BUILDING ONTOLOGIES.................................................................................................................. 31 1.1

INTRODUCTION ........................................................................................................... 31

1.2

METHODS AND METHODOLOGIES FOR BUILDING ONTOLOGIES ........................................ 33

1.2.1

The Enterprise Methodology................................................................................... 36

1.2.2

The TOVE Methodology ......................................................................................... 38

1.2.3

The Bernaras methodology ..................................................................................... 40

1.2.4

The METHONTOLOGY methodology ..................................................................... 41

1.2.5

The SENSUS methodology...................................................................................... 43

1.2.6

DILIGENT ............................................................................................................ 44

1.3 1.3.1

WHERE IS THE MELTING POINT? ................................................................................... 45 Similarities between methodologies......................................................................... 46

i

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

1.3.2

2

Shortcoming of the methodologies........................................................................... 47

1.4

ACKNOWLEDGEMENTS ................................................................................................ 51

1.5

REFERENCES .............................................................................................................. 51

CHAPTER II - THE MELTING POINT, A METHODOLOGY FOR DEVELOPING ONTOLOGIES WITHIN DECENTRALISED SETTINGS ............................................. 56 2.1

INTRODUCTION ........................................................................................................... 56

2.2

TERMINOLOGICAL CONSIDERATIONS ............................................................................ 58

2.3

THE METHODOLOGY AND THE LIFE CYCLE .................................................................... 59

2.3.1

Documentation processes ....................................................................................... 60

2.3.1.1

Activities for documenting the management processes .................................................... 60

2.3.1.2

Documenting classes and properties ............................................................................... 61

2.3.2

Management processes .......................................................................................... 62

2.3.2.1

Scheduling ................................................................................................................... 62

2.3.2.2

Control ........................................................................................................................ 62

2.3.2.3

Inbound-interaction ...................................................................................................... 62

2.3.2.4

Outbound-interaction .................................................................................................... 63

2.3.2.5

Quality assurance ......................................................................................................... 63

2.3.3

Development-oriented processes............................................................................. 63

2.3.3.1

Feasibility study and milestones .................................................................................... 63

2.3.3.2

Activities for the conceptualisation ................................................................................ 63

2.3.3.2.1.1 2.3.3.3

Milestones, techniques and tasks for the ka and da activities............................. 66

Iterative Building of Ontology Models (IBOM). ............................................................. 66

2.3.3.3.1

Methods, Techniques and Milestones for the IBOM. ............................................... 66

2.3.3.4

Formalisation ............................................................................................................... 67

2.3.3.5

Evaluation.................................................................................................................... 67

2.3.3.5.1

Application-dependent evaluation .......................................................................... 68

2.3.3.5.2

Terminology evaluation. ....................................................................................... 68

2.3.3.5.3

Taxonomy evaluation............................................................................................ 68

2.3.3.6

A summary of the process. ............................................................................................ 69

2.4

AN INCREMENTAL EVOLUTIONARY SPIRAL MODEL OF TASKS, ACTIVITIES AND PROCESSES 69

2.5

DISCUSSION................................................................................................................ 72

2.6

CONCLUSIONS........................................................................................................ 75

2.7

ACKNOWLEDGEMENTS ................................................................................................ 76

2.8

REFERENCES .............................................................................................................. 76

ii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

3

CHAPTER III - THE USE OF CONCEPT MAPS DURING KNOWLEDGE ELICITATION IN ONTOLOGY DEVELOPMENT PROCESSES................................. 82 3.1

THE USE OF CONCEPT MAPS DURING KNOWLEDGE ELICITATION IN ONTOLOGY DEVELOPMENT PROCESSES – THE NUTRIGENOMICS USE CASE .......................................... 82

3.1.1

Background ........................................................................................................... 82

3.1.1.1

3.1.2

A survey of methodologies............................................................................................ 84

Methods ................................................................................................................ 88

3.1.2.1

General view of our methodology .................................................................................. 88

3.1.2.2

Scenarios and ontology development process ................................................................. 92

3.1.2.2.1

Identification of purpose, scope, competency questions and scenarios....................... 93

3.1.2.2.2

Identification of reusable and recyclable ontologies ................................................. 94

3.1.2.2.3

Domain analysis and knowledge acquisition ........................................................... 94

3.1.2.2.3.1

Attributes of the domain experts..................................................................... 95

3.1.2.2.3.2

The knowledge elicitation sessions ................................................................. 95

3.1.2.2.3.3

Representing conceptual queries .................................................................... 96

3.1.2.2.4

3.1.3

Iterative building of informal ontology models........................................................ 97

Future work........................................................................................................... 99

3.1.3.1

Formalisation ............................................................................................................... 99

3.1.3.2

Evaluation.................................................................................................................. 101

3.1.4

Discussion ........................................................................................................... 102

3.1.5

Conclusions ......................................................................................................... 104

3.1.6

Acknowledgements............................................................................................... 105

3.1.7

References ........................................................................................................... 106

3.2

THE USE OF CONCEPT MAPS FOR TWO ONTOLOGY DEVELOPMENTS: NUTRIGENOMICS, AND A MANAGEMENT SYSTEM FOR GENEALOGIES. .................................................................

4

110

3.2.1

Introduction......................................................................................................... 110

3.2.2

Methodology........................................................................................................ 112

3.2.3

CM plug-in for Protégé ........................................................................................ 113

3.2.4

Conclusions and future work. ............................................................................... 115

3.2.5

Acknowledgements............................................................................................... 115

3.2.6

References ........................................................................................................... 115

CHAPTER IV - COGNITIVE SUPPORT FOR AN ARGUMENTATIVE STRUCTURE DURING THE ONTOLOGY DEVELOPMENT PROCESS .......................................... 119 4.1

INTRODUCTION ......................................................................................................... 119

iii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

5

4.2

ARGUMENTATIVE STRUCTURE AND CMS .................................................................... 120

4.3

ARGUMENTATION VIA CMS ....................................................................................... 122

4.4

DISCUSSION AND CONCLUSIONS ................................................................................. 126

4.5

REFERENCES ............................................................................................................ 126

CHAPTER V -NARRATIVES AND BIOLOGICAL INVESTIGATIONS .................... 130 5.1

THE USE OF CONCEPT MAPS AND AUTOMATIC TERMINOLOGY EXTRACTION DURING THE DEVELOPMENT OF A DOMAIN ONTOLOGY. LESSONS LEARNT. ........................................

5.1.1

Introduction......................................................................................................... 130

5.1.2

Survey of methodologies....................................................................................... 131

5.1.3

General view of our methodology. ........................................................................ 133

5.1.4

Our scenario and development process ................................................................. 136

5.1.5

Results: GMS baseline ontology............................................................................ 137

5.1.6

Discussion and conclusions .................................................................................. 140

5.1.7

References ........................................................................................................... 142

5.2

6

130

A PROPOSED SEMANTIC FRAMEWORK FOR REPORTING OMICS INVESTIGATIONS. ............. 145

5.2.1

Introduction......................................................................................................... 145

5.2.2

Methodology........................................................................................................ 147

5.2.3

The RSBI Semantic Framework ............................................................................ 148

5.2.4

Conclusions and Future Directions....................................................................... 149

5.2.5

References ........................................................................................................... 150

CHAPTER VI - INFORMATION INTEGRATION IN MOLECULAR BIOSCIENCE 154 6.1

OVERVIEW OF ISSUES AND TECHNOLOGIES.................................................................. 156

6.1.1

Data availability .................................................................................................. 157

6.1.2

Data quality......................................................................................................... 157

6.1.3

Standardisation ................................................................................................... 158

6.1.4

Language ............................................................................................................ 158

6.1.5

Access ................................................................................................................. 159

6.2

STRATEGIES FOR DATA INTEGRATION ......................................................................... 160

6.2.1

Platforms ............................................................................................................ 160

6.2.2

Developments ...................................................................................................... 163

6.2.2.1

Sequence Retrieval System (SRS)................................................................................ 163

6.2.2.2

GeneCards® .............................................................................................................. 165

6.2.2.3

Entrez........................................................................................................................ 165

iv

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

7

6.2.2.4

Ensembl..................................................................................................................... 165

6.2.2.5

BioMOBY ................................................................................................................. 166

6.2.2.6

myGrid ...................................................................................................................... 167

6.2.2.7

Others........................................................................................................................ 167

6.3

SEMANTIC INTEGRATION OF INFORMATION IN MOLECULAR BIOSCIENCE ...................... 167

6.4

XML AS A DESCRIPTION OF DATA AND INFORMATION.................................................. 173

6.5

GRAPHICAL USER INTERFACES (GUIS) AS INTEGRATIVE ENVIRONMENTS...................... 175

6.6

METABOLIC PATHWAY DATABASES AS AN EXAMPLE OF INTEGRATION......................... 179

6.7

SUMMARY, CONCLUSIONS AND UNSOLVED PROBLEMS................................................ 183

6.8

ACKNOWLEDGMENTS................................................................................................ 187

6.9

REFERENCES ............................................................................................................ 187

CHAPTER VII - WORKFLOWS IN BIOINFORMATICS: META-ANALYSIS AND PROTOTYPE IMPLEMENTATION OF A WORKFLOW GENERATOR .................. 195

8

9

7.1

BACKGROUND .......................................................................................................... 195

7.2

RESULTS .................................................................................................................. 198

7.2.1

Syntactic and algebraic components ..................................................................... 199

7.2.2

Workflow generation, an implementation .............................................................. 205

7.3

ARCHITECTURAL DETAILS......................................................................................... 206

7.4

SEMANTIC AND SYNTACTIC ISSUES ............................................................................. 207

7.5

DISCUSSION.............................................................................................................. 210

7.6

CONCLUSION ............................................................................................................ 214

7.7

ACKNOWLEDGEMENTS .............................................................................................. 214

7.8

REFERENCES ............................................................................................................ 215

CONCLUSIONS AND DISCUSSION............................................................................. 216 8.1

SUMMARY ............................................................................................................... 216

8.2

BIOLOGICAL INFORMATION SYSTEMS AND ONTOLOGIES. ............................................. 217

8.3

TOWARDS A SEMANTIC WEB IN BIOLOGY .................................................................... 220

8.4

DEVELOPING BIO-ONTOLOGIES AS A COMMUNITY EFFORT. ........................................... 224

8.5

REFERENCES ............................................................................................................ 226

FUTURE WORK ............................................................................................................ 228 9.1

BIO-ONTOLOGIES: THE MONTAGUES AND THE CAPULETS, ACT TWO, SCENE TWO: FROM VERONA TO MACONDO VIA LA MANCHA. ................................................................... 228

v

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

9.1.1

Introduction......................................................................................................... 228

9.1.2

Some background information .............................................................................. 230

9.1.3

The Duels and the duets. ...................................................................................... 231

9.1.4

Marriage, Poison, and Macondo .......................................................................... 233

9.1.5

References ........................................................................................................... 234

APPENDIXES ........................................................................................................................ 236 GLOSSARY ............................................................................................................................

236

ACRONYMS .......................................................................................................................... 240 APPENDIX 1 – RSBI ONTOLOGY .............................................................................................

242

APPENDIX 2 – EXTRACTED TERMINOLOGY ............................................................................. 246 APPENDIX 3 – GMS BASELINE ONTOLOGY (VERSION 1) .......................................................... 258 APPENDIX 4 - GMS BASELINE ONTOLOGY (VERSION 2) ........................................................... 261 APPENDIX 5 – PROTOCOL DEFINITION FILE GENERATED BY G-PIPE ......................................... 266 INDEX .................................................................................................................................. 269

vi

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

LIST OF FIGURES

Introduction - Figure 1. Controlled vocabularies and ontologies.................................................xvi

Chapter 1 - Figure 1. Uschold and King methodology........................................................................37 Chapter 1 - Figure 2. The TOVE methodology..............................................................................39 Chapter 1 - Figure 3. METHONTOLOGY ...................................................................................42 Chapter 1 - Figure 4. Similarities amongst methodologies. ............................................................47

Chapter 2 - Figure 1. Terminological relationships. ........................................................................58 Chapter 2 - Figure 2. Life cycle, processes, activities, and view of the methodology..................60 Chapter 2 - Figure 3. An incremental evolutionary spiral model of tasks, activities and processes. ................................................................................................................................71 Chapter 2 - Figure 4. Adding a term. ................................................................................................74

Chapter 3 - Figure 1. View of a concept map. .................................................................................89 Chapter 3 - Figure 2. Steps (1-6) and milestones (boxes). ..............................................................89 Chapter 3 - Figure 3. CMs as means to structure a conceptual query. ..........................................97 Chapter 3 - Figure 4. Elicitation of Is_a, whole/part-of, and classes............................................98 Chapter 3 - Figure 5. Methodology, milestones, and phases........................................................112

Chapter 4 - Figure 1. The major concepts of the argumentation ontology and their relations.121

vii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 4 - Figure 2. A simplification of the argumentative structure presented by Tempich et al.............................................................................................................................................123 Chapter 4 - Figure 3. Biomaterial from MGED ............................................................................125

Chapter 5 - Figure 1. A schematic representation of our process, extending GM. ...................134 Chapter 5 - Figure 2. Classes, instances, and relationships gathered by bringing together extracted terms and previously built ontological models. ...............................................137 Chapter 5 - Figure 3. Narrative, as seen from those concept maps and ontology models domain experts were building. ...........................................................................................139 Chapter 5 - Figure 4. Baseline ontology..........................................................................................140 Chapter 5 - Figure 5. Our methodology. ........................................................................................148 Chapter 5 - Figure 6. A view of a section of the RSBI ontology. ................................................149

Chapter 6 - Figure 1. Schematic representation of the architecture of TAMBIS. .....................172 Chapter 6 - Figure 2. Valine biosynthetic pathway in Escherichia coli .......................................182

Chapter 7 - Figure 1. Syntactic components describing bioinformatics analysis workflows. ...199 Chapter 7 - Figure 2. Syntactic components and algebraic operators. ........................................200 Chapter 7 - Figure 3. Phylogenetic analysis workflow ..................................................................204 Chapter 7 - Figure 4. Case workflow ..............................................................................................205 Chapter 7 - Figure 5. G-PIPE Architecture. ..................................................................................207 Chapter 7 - Figure 6. Designing SNPs............................................................................................208 Chapter 7 - Figure 7. Mapping the RSBI........................................................................................209 viii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 7 - Figure 8. G-PIPE..........................................................................................................213

Appendix 1 - Figure 1. Identified properties for the RSBI ontology. .........................................243 Appendix 1 - Figure 2. RSBI ontology ...........................................................................................244 Appendix 1 - Figure 3. A concept map for RSBI ontology. ........................................................245

Appendix 3 - Figure 1. A portion of the first version of the GMS ontology, Germplasm. .....258 Appendix 3 - Figure 2. The Germplasm Method section of the first version of the GMS ontology. ...............................................................................................................................259 Appendix 3 - Figure 3. The Germplasm Identifier section of the first version of the GMS ontology. ...............................................................................................................................260

Appendix 4 - Figure 1. Identified properties for the RSBI ontology. .........................................262 Appendix 4 - Figure 2. Genetic Constitution, as understood by the GMS ontology................263 Appendix 4 - Figure 3. Germplasm Breeding Stock, a portion of the second version of the GMS ontology......................................................................................................................263 Appendix 4 - Figure 4. Naming convention according to the second version of the GMS ontology. ...............................................................................................................................263 Appendix 4 - Figure 5. Plant Breeding Method according to the second version of the GMS ontology. ...............................................................................................................................264 Appendix 4 - Figure 6. PlantPropagationProcesses according to the second version of the GMS ontology......................................................................................................................264 Appendix 4 - Figure 7. Some of the parent classes in the RSBI ontology..................................265

ix

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

LIST OF TABLES

Chapter 1 - Table 1. Summary of methodologies............................................................................46

Chapter 2 - Table 1. A summary of the development process. .....................................................69 Chapter 2 - Table 2. Methodology compliance with IEEE ...........................................................73

Chapter 3 - Table 1. Comparison of methodologies.......................................................................86 Chapter 3 - Table 2. Example of the structure of linguistic definitions. .......................................91 Chapter 3 - Table 3. Examples of competency questions ..............................................................93

Chapter 5 - Table 1. Comparison of methodologies.....................................................................132

Chapter 6 - Table 1. Some existing developments in database integration in molecular biology164 Chapter 6 - Table 2. Some of the most commonly used Graphical User Interfaces (GUIs) for EMBOSS and GCG® ........................................................................................................176

Chapter 7 - Table 1. Algebraic operators........................................................................................201 Chapter 7 - Table 2. Operator specifications. ................................................................................203

x

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

ACKNOWLEDGMENTS

“Acknowledgments” is usually the part of the thesis in which the author mention those who have participated in the development and evolution of the research work. Expressing gratitude to all those who had any kind of participation in the development of this work is, in my opinion, mandatory. I do certainly thank all of them, for their understanding, consideration, patience, and constant support throughout these, almost four years. However, it is usually the case that some people acquired a more prominent role, and I am reserving this section in order to express in a special way my gratitude for their actions. Firstly, I thank my mother, without whose example and constant support I would ever have found the courage for going through the whole doctoral process. I thank my sister, who taught me an important lesson that helped me to understand the value of family in those times in which I may not fully have appreciated it. My deepest gratitude goes to my entire family, for those obvious things, but most of all for their unconditional love. For having taught me how important it is to have a non-dogmatic, conciliatory attitude, as well as respect and trust for the written word, I would like to express my gratitude to my supervisor, Mark Ragan. The present work would have never been possible without the understanding of those factors that make our work as knowledge engineers so interesting; human factors are also those that make it so hard to represent and formalise knowledge. Is there any piece of knowledge that exists independently from a human being? In my opinion the answer is a straight no. For having helped me to understand this in particular, I would like to specially thank Susana Sansone and Sue Roberthone. For advice on spelling and grammar, I thank Kieran O'Neill. Robert Stevens, Mark Wilkinson, Limsoong Wong, Vladimir Brusic and Kaye Basford are persons for whom I feel a deep gratitude for having understood the importance of my work; but most of all for having had trust in me.

xi

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Finally, I am reserving my words to say “thanks”; not so much for the knowledge that we all shared thought these years, but the humanity that allowed us to relate to each other as human beings. Fortunately this research work proved to have a direct impact not only in my understanding of the domain of knowledge, but also, and more importantly, in the importance of those human factors within all of us.

xii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

INTRODUCTION

OVERVIEW High-throughput techniques have allowed the production of massive amounts of data in modern biology. Sequencing full genomes is now part of a bigger task, that of identifying the functional regions of genomes, or functional genomics. As modern biology becomes more and more dependent on information technology, it also poses new challenges to computer scientists. Integrative view that functional genomics demands, as it relates information from different sources, may not be fully covered by today’s technology. Answering users’ queries by providing them with an integrated, contextualised view has long ago been considered to be one of the greatest challenges in natural language processing and information retrieval [1]. In order to integrate heterogeneous information effectively, a number of approaches applied to the bio-domain have been studied. Some of these are analyzed in the first chapter of this thesis. Syntactical issues have been resolved, as different standardisation efforts have been launched. However, modern biology still lacks the integrated view that is required. Semantic issues have been identified as extremely important, and consequently the biological community has organised a number of consortia that have taken care of developing biological ontologies. This introductory portion of the thesis is organised as follows. Initially, a brief overview is given. The presentation of those main components and concepts (ontologies and communities) of this thesis is given in pages XVII, and XIX. In these sections the broad problem-space within which this research is situated is illustrated. The next section presents the thesis outline; then, page XXIII exhibits hypothesis and research questions addressed by this investigation. A list of those publications as well as software products that have arisen from this thesis is given in the last section of this introductory chapter.

xiii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

WHAT IS AN ONTOLOGY? Definitions for the word ontology vary depending on the field; computer scientists tend to understand the term in a more utilitarian way, whereas philosophers tend to have a more holistic understanding of it. The term “ontology” (Greek on=being, logos=to reason) has its roots in philosophy; it has traditionally been defined as the philosophical study of “what exists”: the study of the kinds of entities in reality, and the relationships that these entities bear to one another [2, 3]. Guarino [4] beautifully summarises the meaning of ontology as being “a branch of metaphysics which deals with the nature and organization of realty”. The meaning of the word ontology in philosophy is “the metaphysical study of the nature of being and existence” [5]. While within the philosophy community there is consensus on the definition for ontology, there is still some dispute amongst members of the artificial intelligence (AI) community. This is partly due to their goal, which is not always to study the nature of “what exists” but how to classify, manage and organise information. For those within the AI community the context in which the ontology is going to be used largely influences the definition of the term. At a glance, an ontology represents a view of the world with the set of concepts and relations amongst them, all of these defined with respect to the domain of interest. For instance, John F. Sowa [6], defines the term as: “The subject of ontology is the study of the categories of things that exist or may exist in some domain. The product of such a study, called an ontology, is a catalog of the types of things that are assumed to exist in a domain of interest D from the perspective of a person who uses a language L for the purpose of talking about D.”

Computer scientists tend to view ontologies as being terminologies with associated axioms and definitions, structured so as to support software applications [7] or in more detail explained by Gruber: “vocabularies of representational terms, classes, relations, functions and object constants with

agreed-upon definitions in the form of human-readable text and machine-enforceable, declarative constraints on their well-formed use.” [8] Even more succinctly, Gruber defines: “An ontology is a formal specification of a conceptualization”.

xiv

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

In order to understand this definition Gruber et al., as well as Studer et al. agree on the following terminology [9, 10]: • “Conceptualization” is an abstract, simplified model of concepts in the world, usually limited to a particular domain of interest. • “Explicit” indicates that the type of domain concepts and the constraints imposed on their use are explicitly defined. • “Formal” means that the ontology specification must be machine readable Others, such as Neches [11], by contrast consider an ontology to be: “The definition of the basic terms and relations comprising the vocabulary of a topic area, as well as the rules for combining terms and relations to define extensions to the vocabulary.”

Depending on the understanding of conceptualisation and context there are different interpretations for the term “ontology”. Independently from the understanding of these terms it could be said that every ontology model for knowledge representation is either explicitly or implicitly committed to some conceptualisation. As this thesis’s context is that of an information system the definition of an ontology that best serves our purpose is: “An ontology is a non-necessarily complete, formal classification of types of information structured by relationships defined by the vocabulary of the domain of knowledge and by the canonical formulations of its theories”

Guarino and Smith heavily influence this definition. It complies with Guarino in that an ontology is, possibly, an incomplete agreement about a conceptualisation and not a specification of the conceptualisation. Ontologies should therefore be understood as agreements amongst people within a community sharing interest in a common domain. By “incomplete” it is understood that the classification of types of information should be left open for interoperability purposes. By “formal” it is meant that the ontology specification can be easily translated into a machine-readable code, as the ontology should support inference processes within those information systems using it. However, it should be noted, that the latter is not mandatory when defining ontologies on an abstract level.

xv

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

CONTROLLED VOCABULARIES AND ONTOLOGIES Controlled vocabularies (CVs) are taxonomies of words built upon an is-a hierarchy; as such they are not mean to support any reasoning process. Controlled vocabularies per se describe neither relations among entities nor relations among concepts, and consequently cannot support inference processes. CVs may be part of ontologies when they instantiate classes. As the process of developing ontologies moves forward the hierarchy is formalised not only by means of is-a and part-of, but also other relations are used, as well as logical operators and description logics constructs. Figure 1 illustrates how within the process of developing ontologies CVs play an important role. Within biological sciences ontologies have been understood to be highly related to controlled vocabularies. Gene Ontology (GO) [12] as well as the Microarray Gene Expression Data (MGED) ontology (henceforth MO) [13] [14] have been used primarily to annotate and unify data across biological databases. These controlled vocabularies have evolved over time; the hierarchies upon which they have been built have used two kind of properties, is-a and part-of. Thus an ontology is not simply a controlled vocabulary, nor merely a dictionary of terms. Controlled vocabularies per se describe neither relations among entities nor relations among concepts, and consequently cannot support inference processes.

Introduction - Figure 1. Controlled vocabularies and ontologies.

Independently from the methodology for developing ontologies, controlled vocabularies are used at different stages during the development of the ontology. For xvi

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

instance, some methodologies suggest the use of lists of lists of words from the beginning as a mean to facilitate the identification of classes [15]. Others, such as Good et al. [16] use lists of words to frame the knowledge elicitation process when developing the ontology. A more indepth analysis on methodologies for developing ontologies is presented in chapter one. WHY ONTOLOGIES? Several authors have extensively discussed the “whys” for ontologies. Within the computer science community these reasons have been summarised [17, 18]. • To clarify and share the structure of knowledge Different information systems might follow different business logics. However, they are considered to be interoperable if they can exchange data and information. Such heterogeneous applications are only able to share information if there is an agreed common vocabulary to describe those items these information systems are meant to manage. • To allow reusing knowledge This is particularly evident within the biological domain. As large domains of knowledge are highly fragmented, communities of experts have developed their own ontologies that should in principle allow others to reuse them whenever needed. Thus the ability to integrate and reuse an existing ontology, without needing to rebuild it, provides a great benefit. Although reuse has been accepted to be one of the major advantages for using ontologies, it is not clear how a merger or integration of ontologies should be carried out. Ontology interoperability has been recognised as a challenging and as yet unachieved task. No current ontology-building methodology really addresses this issue or deals with it explicitly. There is no consensus for the methods used in merging and integration. These are still unclear and more of an art than a methodology [19]. These issues are still part of ongoing research in the area [20, 21]. • To make the assumptions used to create the domain explicit • To allow a clear differentiation between domain knowledge and operational knowledge xvii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Operational knowledge should be here understood as that due to every-day practice. Domain knowledge, by contrast, is the kind of knowledge that allows for creation and generation of complementary discourse. WHY COMMUNITIES? A particularly recurrent and important term throughout this thesis is “community”, and more broadly “community of practice”. Wenger defines communities of practice: “Communities of practice are the basic building blocks of a social learning system because they are the social ‘containers’ of the competences that make up such a system… Communities of practice define competence by combining three elements. First, members are bound together by their collectively developed understanding of what their community is about and they hold each other accountable to this sense of joint enterprise. To be competent is to understand the enterprise well enough to be able to contribute to it. Second, members build their community through mutual engagement. They interact with one another, establishing norms and relationships of mutuality that reflect these interactions. To be competent is to be able to engage with the community and be trusted as a partner in these interactions. Third, communities of practice have produced a shared repertoire of communal resources—language, routines, sensibilities, artefacts, tools, stories, styles, etc. To be competent is to have access to this repertoire and be able to use it appropriately.” [22-24].

Interestingly, Wenger emphasises the “shared repertoire of resources” such as “language, techniques artifacts”; this part of his definition has a remarkable parallel within an apparently unrelated community, knowledge management. Knowledge is defined by Davenport and Prusak as: “Knowledge is a mix of framed experience, values, contextual information, expert insight and grounded intuition that provides an environment and framework for evaluating and incorporating new experiences and information. It originates and is applied in the minds of knowers. In organisations, it often

becomes embedded not only in documents or repositories but also in organisational routines, processes, practices and norms.”[25]

xviii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Davenport and Prusak place emphasis on “organisational routines, processes, practices and norms”. These shared repertoires between these two definitions make it clear that communities of practice are brought together by their intersecting knowledge. Ontologies in bioinformatics have been developed by communities of practices for which there is a common need. For instance the MGED society, which is an international organisation of biologists, computer scientists, and data analysts that aims to facilitate the sharing of microarray data generated by functional genomics and proteomics experiments [14], develops and maintains MO. They have initially focused on establishing standards for microarray data annotation and exchange, facilitating the creation of microarray databases and related software implementing these standards [14]. This does not mean that other omics technologies are not currently being considered. Annotating microarray experiments has been made possible by means of MO as it provides a controlled vocabulary for describing microarray experiments. MO is in principle independent from any software development using it. As microarray investigations can be interpreted only in the context of the experimental conditions under which the samples used in each hybridisation were generated [26], MO makes it possible not only to share but also to better understand the context in which results were generated. Within this context the communities of practice are being brought together by a common need and interest (e.g. the use of a particular kind of technology), as well as by “organizational routines, processes, practices and norms”. Their interaction takes place mostly via electronic means such as wiki-pages, email, concurrent version systems (CVS), and phone conferences. As the goal in life sciences is to make information available and exchangeable in the form of virtual knowledge a more suitable and complementary definition for communities of practice is given by: “Virtual communities of practice are communities of practice (and the social ‘places’ that they collectively create) that rely primarily (though not necessarily exclusively) on networked communication media to communicate, connect, and carry out community activities”

Biological communities are indeed communities of practice; not only do they have their own way to interact (e.g. papers, conferences) but also, and more importantly, no matter how fragmented they have a common vocabulary. Electronic means have facilitated not only the xix

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

interaction but also the fragmentation of this community; interestingly it has also facilitated the standardisation across the entire domain of knowledge by making explicit the need for a holistic approach. For instance, data entries in GenBank [27] may encode a human Mendelian disease, for which there are both metabolic pathways as well as reported single nucleotide polymorphisms (SNPs) of interest. Such information may be scattered across Gene Cards [28], BioCyc [29] and possibly other databases. Despite the divisions and specialisations of the field, the systems studied by biological sub-communities interact in reality, and it is precisely because of this that this community needs ontologies. As the knowledge is not owned by a particular group, these knowledge should be captured and represented from and by the community [15, 30]. BRINGING IT ALL TOGETHER Communities have been developing ontologies in order to describe those entities that we study, genes, proteins, DNA-binding factors, as well as biomaterial, and technologydependent artefacts. Gene Ontology [31] is an example of an ontology that aims to describe those things biologists study, whereas MGED [32] ontology may be seen as one that aims to describe the process by which we study those “things”. It is by using descriptors provided by both ontologies that accurate representations of research endeavors may be possible. At any given time a toxicological study may use a part of the liver of a rat in order to profile the response of genes to a certain perturbation. In order to describe such an effort, different ontologies are needed, some to describe the biology of the research endeavor (e.g. cells, cellular compartment, animal, organism, etc.) as well as some to describe the techniques used (e.g. microarrays, proteomics, PCR, chromatography, etc.). Some of the required ontologies exist; however they are not always sufficient, nor are they used for annotation in all biological investigations. Different views on the same issue may mean that those mechanisms for involving the biological community in the development of their own ontologies should be improved. The lack of methodologies and software tools supporting these methodologies is a bottleneck in the development of biological ontologies.

xx

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

This research has analyzed the biological community as well as the intended use of some of those ontologies currently under development. Different scenarios arise from three ontology developments in which the author took part. These cases allowed a careful and exhaustive study of the dynamic and features these kinds of developments have. The nutrigenomics community permitted the author to understand the behavior of communities when developing ontologies, as well as the significance of groupware technology for developing loosely centralised ontologies. From this initial experience it was also possible to identify and illustrate how concept maps could be used to support knowledge elicitation during the development of ontologies. A methodology describing how biological ontologies could be better developed was consequently proposed. Two other scenarios were studied. The Reporting Structure for Biological Investigations (RSBI) case was one that aimed to define the structure and semantics for reporting a biological investigation. Influenced by MIAME, the RSBI working group addressed the issue of investigations in a broader sense; the context was not limited to describing a microarray experiment, but any biological experiment. This experience was interesting not only because of the involvement of three different communities (toxicgenomics, environmental genomics, and nutrigenomics) but most importantly because it was easy to understand how difficult it was to describe an investigation; how could technology be classified in a way that inference is possible within any given Laboratory Information Management System? As a high level container, investigation, what minimal descriptors should accompany it in order to provide an insightful, useful and comprehensive view of the whole investigation? Finally, another ontology was also supported during its development, the Genealogy Management System (GMS) Ontology. The GMS Ontology provided us with a fertile ground in which it was possible to extend the methodology proposed from the nutrigenomics case. Conceptual maps facilitate knowledge elicitation and sharing, but it is not easy to frame the view of the domain experts, sometimes they tented to be quite specific, and some other times quite general. By combining terminology extraction and conceptual mapping it became xxi

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

possible to constrain the elicitation exercises with domain experts; thus making it possible to capture classes and differentiate them from instances at early stages of the elicitation process. The three previously mentioned ontology developments permitted the study of existing software from the perspective of both users and knowledge engineers in decentralised settings. Also from these experiences it was possible to identify the argumentative structure that takes place when developing ontologies. RESEARCH PROBLEM This thesis has laid down a series of questions not previously considered when studying biological ontologies, how to develop them, and their uses when integrating information. Throughout this thesis, integration of information in bioinformatics is studied mainly from the semantic perspective, placing particular attention on the actual process by which the ontology is being developed. Despite this emphasis, other aspects related to integration of information have also been considered. The overall research problem in this thesis is: “The participation of communities in the development of biological ontologies poses challenges not previously considered by existing methodologies for developing ontologies.” To address this problem, the author presents a series of hypotheses and questions. These seek to explore and analyze methodological and practical challenges in developing ontologies within the bioscience community. 1. If ontologies are to be developed by communities then the ontology development life cycle should be better-understood within this context. 2. When eliciting knowledge for developing an ontology on a community basis there is an increasing need not only to support the process as such, but also to facilitate the communication, structure and exchange of information amongst the participants of the process. 3. If biological investigations bring together a mix of disciplines then the descriptions for such research endeavours should encompass all those different views. xxii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

4. How should a well-engineered methodology facilitate the development of ontologies within communities of practice? What methodology should be used? By answering these questions this doctoral work addresses the ontology development, information integration and the study description problem in modern biology and proposes different methods to facilitate information integration across various information systems. Initial chapters focus on providing answers to the main research question in this thesis. By investigating existing methodologies and analyzing them within the context of biological communities of practice it was possible to propose a methodology and understand the life cycle of these ontologies. Those experiences that allowed the author to gather information, test and improve the methodology are presented in chapters 3, 4, and 5. As it was important for the successful conclusion of this thesis to constantly test those research outcomes a simple, yet quite illustrative scenario, was laid down. This scenario (chapter 7) allowed the author to study not just the development of ontologies, but also the use of ontologyes by software layers within an integrative environment that also had a community of users. CONTRIBUTIONS OF THIS THESIS Four are the main contributions of this doctoral work: • Improving our understanding of the role of ontologies in the domain of bioinformatics and laboratory information management, • Developing a way to engineer ontologies within this domain, • Developing several ontologies of substantial complexity and of practical use in real applications, and • Developing a workflow system of substantial sophistication for which syntactical and semantic aspects are easily observable and manageable. Throughout the development of this thesis work special emphasis was placed in studying cases for which this work could have a direct impact. The search for and interest in real scenarios allowed me to extensively collaborate with other groups such as EBI (European Bionformatics Institute), Pasteur Institute, CGIAR (Consultative Group in Agricultural Research) and ACPFG (Australian Centre for Plant Functional Genomics). It also gave us an xxiii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

additional reason to publish our work; this active communication, via different means, enabled us to receive relatively rapid feedback regarding our work. The combination of collaborations and publications enriched this work, but more importantly permitted the rapid use and testing of those intermediary products of this work. A list of research outcomes and outputs of this thesis is given below. OUTLINE OF THIS THESIS This thesis begins by addressing the problem of information integration, and examines the syntactic and semantic factors that should be taken care of when describing experiments. A particular task was always considered as crucial throughout the development of this thesis: intelligent information retrieval. Attention was focused on semantic issues associated with information integration: How could the reproducibility of biological experiments be ensured? How could experiments be effectively shared? How could ontologies be built while ensuring the participation of a wider community? Special attention was given to the involvement of the community when developing ontologies. The agreement is critical as it, to some exempt, assures the use and some how the correctness of the ontology. This thesis is organised into a series of chapters that address some aspects related to semantic issues in integration of information in bioinformatics. Chapter I, “Communities at the melting point when building ontologies”, is a critical analysis of those existing methodologies for developing ontologies; not only existing methodologies are presented, but also it is analyzed how could these methodologies be used within the biological domain, as well as which issues should be considered in order to propose a new methodology. Chapter II, “The melting point, a methodology for developing ontologies within decentralised settings” presents a novel methodology that has been engineered upon cases extracted form real scenarios. In principle this methodology may be used not only within the bio domain but also in other contexts. Chapter III, “The use of concept maps during knowledge elicitation in ontology development processes”, presents the development of biological ontologies, factors associated with this process as well as a xxiv

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

process that was followed. Chapter IV presents how cognitive support may be provided by means of concept maps during the argumentative process that takes place when developing ontologies. For this particular task we used two unrelated scenarios: Reporting Structure for Biological Investigations (RSBI) and Genealogy Management Systems (GMS). It is important to notice that both, Chapters III and IV, proved to be a fertile playground in which the methodology presented in Chapter II was engineered; these two experiences were for the development of this thesis experiments from which valuable information was gathered. Chapter V presents an literature review in which different approaches to integration of molecular data, as well as analytical tools are analyzed; this chapter aims to facilitate the transition into chapter VI in which a different scenario, extracted mostly from in silico biology, is studied from both syntactic and semantic perspectives. Interestingly, during the development of this part of the doctoral work it became possible to understand better how syntactically based solutions, despite being workable tools, still lack some important features that only the correct use of ontologies could provide. As in silico experiments are also valid examples of biological investigations another important outcome from Chapter V was the actual practical use of the ontology proposed in Chapter IV. Discussions, conclusions and future work are presented in the remaining chapters of this thesis. In part, this was done by using literary analogies, mostly with Shakespeare’s masterpiece “Romeo and Juliet” and also with “A hundred years of solitude” by Garcia Marquez. These analogies seemed ideal because they illustrate what, in my opinion, constitutes a central problem in the development of biological ontologies, and more broadly in the development of information systems, namely interdisciplinary work. The relationship between bioinformatics and the semantic web is used as an introduction to the rest of discussions and conclusions. Chapter VII presents some future work, using literary analogy here mentioned.

xxv

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

PUBLISHED PAPERS 1.

Garcia Castro A, Sansone AS, Rocca-Serra P, Taylor C, Ragan MA: The use of conceptual maps for two ontology developments: nutrigenomics, and a management system for genealogies. In: 8th Intl Protégé Conference Protégé: 2005; Madrid, Spain; 2005: 59-62.

2.

Garcia Castro A, Chen Y-PP, Ragan MA: Information integration in molecular bioscience: a review. Applied bioinformatics 2005, 4(3):157-173.

3.

Garcia Castro A, Chen Y-PP, Ragan MA: Workflows in bioinformatics: meta-analysis and prototype implementation of a workflow generator. BMC 2005, 6:87.

4.

Garcia Castro A, Thoraval S, Garcia Castro L-J, Ragan MA: G-PIPE, an implementation. In: NETTAB: 2005; Naples, Italy; 2005.

5.

Garcia Castro A, Sansone AS, Taylor CF, Rocca-Serra P: A conceptual framework for describing biological investigations. In: NETTAB: 2005; Naples, Italy; 2005.

6.

Garcia Castro A, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S: The use of concept maps during knowledge elicitation in ontology development processes - the nutrigenomics use case. BMC 2006, 7:267.

7.

Garcia Castro A: Cognitive support for an argumentative structure during the ontology development process. In: 9th Intl Protégé Conference: July, 2006 2006; Stanford, CA, USA; 2006.

8.

Garcia Castro A: The Montagues and the Capulets, act two, scene two: from Verona to Macondo via-La Mancha. Submitted for publication.

9.

Fostel J, Choi D, Zwickl C, Morrison N, Rashid A, Hasan A, Bao W, Richard A, Tong W, Garcia Castro A, Bushel P et al: Chemical effects in biological systems - data dictionary (CEBS-DD): a compendium of terms for the capture and integration of biological study design description, conventional phenotypes and omics data. Journal of Toxicological Sciences 2005, 88(2):585-601.

10.

O'Neill K, Schwegmann A, Jimenez R, Jacobson D, Garcia Castro A: OntoDas - integrating DAS with ontology-based queries. In: Bio-ontologies SIG, ISMB 2007. Viena, Austria; 2007.

xxvi

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

SOFTWARE DEVELOPED, INCLUDING ONTOLOGIES. 1. G-PIPE:

a

workflow

generator

for

PISE:

http://if-web1.imb.uq.edu.au

/Pise/gpipe.html 2. Reporting Structure for Biological Investigations (RSBI), see Appendix 1 3. An

ontology

for

Genealogy

Management

Systems,

available

at:

http://cropwiki.irri.org/icis/index.php/ICIS_Domain_Models, see Appendix 3 and 4 4. Conceptual

mapping

plug-in

for

Protégé,

available

at:

http://if-

web1.imb.uq.edu.au/plug-in.html REFERENCES 1.

Chagoyen-Quiles M: Integration of biologycal data: systems, infrastructures and programmable tools. Doctoral Thesis. Madrid: Universidad Autonoma de Madrid, Escuela Politecnica Superior; 2005.

2.

Smith B, Ceusters W: Ontologies as the core discipline of biomedical informatics, legacies of thepast and recomendations for the future. In: Computing, Philosohy, and Cognitive Sciences. Edited by Crnkovic GD, Stuart S. Cambridge: Cambridge Scrholar Press; 2006.

3.

Simth B: Ontology. In: Guide to Philosophy of Computing and Information. Edited by Floridi L. Oxford: Blackwell; 2004: 155-166.

4.

Guarino N, Giaretta P: Ontologies and Knowledge Bases: Toward a Terminological Clarification. Towards Very Large Knowledge Bases 1995, In N.J.I Mars (ed.):25-32.

5.

WordNet. In: http://wordnet.princeton.edu/. 2007.

6.

Sowa JF: Knowledge Representation: Logical, Philosophical, and Computational Foundation. Pacific Grove, CA: Brooks Cole Publishing Co; 2000a.

7.

Smith B, Williams J, Schulze-Kremer S: The ontology of the gene ontology. In: AMIA Annual Symposium: 2003; 2003: 609-613.

8.

Gruber T: The role of knowledge representation in achieving sharable, reusable knowledge bases. In: Second International conference in Principles of Knowledge Representation and Reasoning. Cambridge, MA; 1991.

9.

Gruber TR: Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human-Computer Studies 1995, 43(5-6):907-928. xxvii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

10.

Studer R, Benjamins VR, Fensel D: Knowledge Engineering: Principles and methods. Data & Knowledge Engineering 1998, 25(1-2):161-197.

11.

Neches R, Finin R, Gruber T, Patil R, Senator T, Swartout WR.: Enabling Technology for Knowledge sharing. AI Magazine 1991:36-55.

12.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. Nat Genet 2000, 25(1):25-29.

13.

Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing functional genomics experiments. Comparative and Functional Genomics 2003, 4(1):127-132.

14.

Microarray Gene Expression Data [http://www.mged.org/]

15.

Garcia CA, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S: The use of concept maps during knowledge elicitation in ontology development processes - the nutrigenomics use case. BMC 2006, 7:267.

16.

Good B, Tranfield E, M.,, Tan P, C., , Shehata M, Singhera G, K.,, Gosselink J, Okon E, B.,, Wilkinson M: Fast, Cheap, and Out of Control: A Zero Curation Model for Ontology Development. In: Pacific Symposium on Biocomputing: 2006; 2006.

17.

Guarino N: Understanding, building and using ontologies. International Journal of Human-Computer Studies 1997, 46(2-3):293-310.

18.

Noy NF, McGuinness DL: Ontology Development 101: a Guide to Creating Your First Ontology. In. Stanford, CA: Stanford University; 2001.

19.

Mirzaee V: An Ontological Approach to Representing Historical Knowledge. PhD Thesis. Vancouver: Department of Electrical and Computer Engineering, University of British Columbia.; 2004.

20.

Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools for Ontologies. In.: The Agricultural Ontology Service (UN FAO); 2003.

21.

Pinto S, Perez AG, Martins JP: Some issues on ontology integration. In: Workshop on Ontologies and Problem-Solving Methods: Lessons Learned and Future Trends (IJCAI99): 1999; Stockholm, Sweden; 1999.

22.

Wenger E, McDermott R, Snyder S: Cultivating communities of practice. A guide to managing knowledge. Boston: Harvard Business School Press; 2002.

23.

Wenger E: Commuities of practice. Learning, meaning, and identity. Cambridge, UK: Cambridge University Press; 1998.

24.

Wenger E: Comunities of practice and social learning systems. Organization 2000, 7(2).

25.

Davenport TH, Prusak L: How organizations manage what they know Boston, Massachusetts: Harvard Bussines School Press; 1998. xxviii

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

26.

Rayner T, F., Rocca-Serra P, P.T. S, Causton C, Helen., Brazma A: A simple spreadsheet-based, MIAME-suportive format for microarray data:MAGETAB. BMC Bioinformatics 2006, 7:489.

27.

Benson DA, Boguski MS, Lipman DJ, Ostell J, Ouellette BFF, Rapp BA, Wheeler DL: GenBank. Nucleic Acids Res 1999, 27(1):12-17.

28.

Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 1998, 14(8):656-664.

29.

Karp PD, Riley M, Paley SM, Pellegrini-Toole A: The MetaCyc database. Nucleic Acids Res 2002, 30(1):59-61.

30.

Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for Distributed, Loosely-controlled and evolving engineering of ontologies. In: European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

31.

Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25(1):25-29.

32.

Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing functional genomics experiments. Comparative and Functional Genomics 2003, 4:127132.

xxix

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Communities at the melting point when building ontologies Although several methodologies have been proposed addressing the problem of building ontologies, ontology engineering does not have a standard methodology. There is an ongoing debate amongst those in the ontology community about the best methodology to build them. Several groups have engineered particular methodologies to solve their specific problems. Some may have been more interested in using the ontology than in how it was built. Others have proposed methodologies without having a specific problem to solve; the methodology itself was the main purpose of their research. These proposed methodologies differ in the stages, steps, methods and techniques; all of them have been conceptualised for scenarios in which domain experts are in one place. None of them explicitly addresses the problem of decentralised settings, furthermore none of them specifically targets domains such as the biological one for which domain experts are at the same time designers and users of the technology. From the investigated methodologies it was possible to identify precisely these stages, steps, methods and techniques that were common, and in principle applicable when developing ontologies within the biological domain. This chapter presents a detailed analysis of these methodologies in order to have a unified comparison criterion by which it was possible to do an in-depth analysis that facilitated the identification of reusable components, shortcomings and strong points in those studied methodologies. Terms such as knowledge engineer, method, domain expert, domain ontology, and many others are here explained. Descriptions of previously proposed methodologies can be found here. The author conceived the project, and identified those key issues elaborated here. The manuscript was entirely written by Alex Garcia Castro.

30

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

1

Chapter I - Communities at the melting point when building ontologies

1.1

INTRODUCTION

Building well-developed ontologies represents an important and difficult challenge as ontology engineering is still in its infancy. It is precisely the availability of standard and broadly applicable methodologies in a particular discipline which represents its “adulthood” stage [1]. Currently ontology engineering has no standard methodology for developing ontologies [2, 3]; there is an ongoing debate amongst those in the ontology community about the best methodology to build them [4-6]. Several groups have engineered particular methodologies to solve their specific problems. Some may have been more interested in using the ontology than in how it was built. Others have proposed methodologies without having a specific problem to solve; the methodology itself was the main purpose of their research. Most of the literature focuses on issues such as the suitability of particular tools and languages for building ontologies, with little attention being given to how ontologies should be built. This is almost certainly because the main interest has been in reporting content and use, rather than in engineering methodologies [7]. Biologists have been building classification systems since before Linnaeus. In the past, biologists have understood classification systems as systems that allow them to identify, name, and group organisms according to predefined criteria. This makes it possible for the community as a whole to be sure they know the exact organism that is being examined and discussed. More recently, the biological community has started to classify genes and gene products; with this need in mind the Gene Ontology (GO) was created. The involvement of the community has played a major role since the foundation of the GO consortium as it is a collaborative effort that addresses the need for consistent descriptions of gene products in different databases [8]. Initially GO provided a controlled vocabulary for only model organisms databases such as FlyBase (Drosophila) [9], the Saccharomyces Genome Database 31

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

(SGD) [10] and the Mouse Genome Database [11] . It has since been adopted as the de-facto standard ontology for describing gene and gene products. The plant ontology (PO) [12] also illustrates a biological ontology for which communities are central to its development. The Plant Ontology Consortium (POC) (www.plantontology.org) is a collaborative effort that brings together several plant database administrators, curators and experts in plant systematics, botany and genomics. A primary goal of the POC is to develop simple yet robust and extensible controlled vocabularies that accurately reflect the biology of plant structures and developmental stages. These vocabularies form a network, linked by relationships facilitating thus the construction and execution of queries that cut across datasets within a database or between multiple databases [12]. The developers of both GO and PO focus on providing controlled vocabularies, facilitating crossdatabase queries, and having strong community involvement. Despite these efforts, bio-ontologies still tend to be built on an ad hoc basis rather than by following well-defined engineering processes. To this day, no standard methodology for building biological ontologies has been agreed upon. The “hacking” process usually involves gathering terminology and organizing it into a taxonomy, from which key concepts are identified and related to create a concrete ontology. Case studies have been described for the development of ontologies in diverse domains, although surprisingly only one of these has been reported to have been applied in a domain allied to bioscience – the chemical ontology [13] – and none in bioscience per se. The actual “how to build the ontology” has not been the main research focus for the bio-ontological community [7]. This chapter presents a description of the previously proposed methodologies in section two. A summary of the comparison, as well as discussion and conclusions, is presented in section three.

32

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

1.2

METHODS AND METHODOLOGIES FOR BUILDING ONTOLOGIES

Several approaches have been reported for developing ontologies; some of them provide insights when developing de novo ontologies, whereas others pay more attention to extending, transforming and re-using existing ontologies. Independently from the focus, both methods and methodologies have not yet been standardised. Not only are there several different methodologies, but also there are numerous software tools aiming to assist knowledge engineers during the process. A methodology is a “comprehensive integrated series of techniques or methods creating a general system theory of how a class of thought-intensive work ought be performed” [14]. Methodologies are composed of both techniques and methods. A method is an “orderly” process or procedure used in the engineering of a product or performing a service [15]. A technique is a “technical and managerial procedure used to achieve a given objective” [14]. Methodologies bring together techniques and methods in an orchestrated way so that the work can be done. From those experiences reported by Garcia et al. [7] as well as by Pinto et al. [16] the knowledge engineer is understood as being a person who applies knowledge engineering techniques to transfer human knowledge into artificial intelligent systems; not only by modelling the knowledge and problem solving techniques of the domain expert into the system but also by promoting the collaboration amongst domain experts. This definition is also influenced by [17, 18]. Several approaches are here analyzed. Strong points, and shortcomings are reviewed according to the following criteria (C), heavily influenced by the work done by Fernandez [1] Mirzaee in [3] and Corcho et al. [19]. C1. Inheritance from knowledge engineering. As most ontology building methodologies are inspired by work done in the field of Knowledge Engineering (KE) to create methodologies for developing knowledge based systems (KBS) this criterion considers the influence traditional KE has had on the studied methodologies.

33

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

C2. Detail of the methodology. This criterion is used to assess the clarity with which the methodology specifies the orchestration of methods and techniques. C3. Strategy for building the ontology. This should provide information about the purpose of the ontology, as well as the availability of domain experts. There are three main strategic lines to consider: i) how tightly coupled the ontology is going to be in relation to the application that should in principle use it; ii) the kinds of domain experts available; iii) the kind of ontology to be developed. These matters are better explained from C3a to C3e. C3a. Application-dependent: The ontology is built on the basis of an application knowledge base, by means of a process of abstraction [1] C3b. Application-semidependent: Possible scenarios of ontology use are identified in the specification stage [1] C3c. Application-independent: The process is totally independent of the uses to which the ontology will be put in knowledge-based systems, agents, etc. C3d. Specialised domain experts: Both C3c and C3e have to do with the kind of domain experts who are available and willing to participate in the development process. This influences C4. Specialised domain experts are those with an in-depth knowledge of their field. Within the biological context these are usually researches with vast laboratory experience, very focused and narrowed within the domain of knowledge. The ontology is built from very specific concepts; this is also known as a bottom-up approach. C3e. Broader-Knowledge domain experts: Broader-Knowledge domain experts are those who tend to have a broader picture. Having this kind of domain experts usually facilitates capturing concepts more related to high-level abstraction, and general processes, rather than specific vocabulary describing those processes. The ontology may be built from high-level abstractions downwards to specifics. This facilitates the approach known as topdown. C3f. Top-level ontologies: These describe very general concepts like space, time, event, which are independent of a particular problem domain. Such unified top-level ontologies aim 34

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

at serving large communities [20]. These ontologies are also known as foundational ontologies; see for instance [21]. C3g. Domain ontologies: These describe specific vocabulary. C3h. Task ontologies: These describe vocabulary related to tasks, processes, or activities. C3i. Application ontologies: As Sure [20] describes them, application ontologies are specialisations of domain and task ontologies as they form a base for implementing applications with a concrete domain and scope. C4. Strategy for identifying concepts. As has been previously mentioned in C3d and C3e there are two strategies regarding the construction of the ontology and the kinds of terms it is possible to capture [22]: The first is to work from the most concrete to the most abstract (bottom-up), whereas the second is to work from the most abstract to the more concrete (top-down). An alternative route is to work from the most relevant to the most abstract and most concrete (middle-out) [1, 22, 23]. C5. Recommended life cycle. Analysis of whether the methodology implicitly or explicitly proposes a life cycle [1]. C6. Recommended methods and techniques. This criterion evaluates whether or not there are methods and techniques as part of the methodology. This is closely related to C2. An important issue to be considered is the availability of software supporting either the entire methodology or a particular method of the methodology. This criterion also deals with the methods or software tools available within the methodology for representing the ontology, whether these be OWL (Ontology Web Language), frames, RDF (Resource Descriptive Framework) etc. C7. Applicability. As knowledge engineering is still in its infancy it is important to evaluate the methodology in the context of those ontologies for which it has been used. C8. Community involvement. As has been pointed out before in this thesis (see chapter one), it is important to know the level of involvement of the community. Phrasing 35

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

this as a question, is the community a consumer of the ontology or is the community taking an active role in its development? C9. Knowledge elicitation. As has been pointed out by [24], knowledge elicitation is a major bottleneck when representing knowledge. It is therefore important to know if the methodology assumes it to be an integral part of this process. The methodologies reviewed under the above criteria are: • The Enterprise Methodology proposed by Uschold and King [25]. • The TOVE Methodology by proposed by Gruninger and Fox [26] • The Bernaras methodology proposed by Bernaras et al. [27] • The METHONTOLOGY methodology proposed by Fernandez et al. [2] • The SENSUS methodology proposed by Swartout et al. [28] • The DILIGENT methodology proposed by Pinto et al. [16, 29] 1.2.1

The Enterprise Methodology Uschold and King propose a set of four activities: • Identify the purpose and scope of the ontology • Build the ontology, for which they specify three activities o Knowledge capture o Development -coding o Integrating with other ontologies • Evaluate • Document the ontology

36

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 1 - Figure 1. Uschold and King methodology.

C1. The methodology does not especially inherit methods from knowledge engineering. Although Uschold and King identify steps that are in principle related to some methodologies from knowledge engineering the authors do not comply with some of the principles in the field. Neither a feasibility study nor a prototype method is proposed. C2. Stages are identified, but no detail is provided. In particular the “Ontology Coding” “Integration” and “Evaluation” sections are presented in a superfluous manner [3]. C3. Very little information is provided. The proposed method is applicationindependent and very general, in principle it is applicable to other domains. The methodology is application-independent. The authors do not present information about the kind of domain experts they advise working with. C4. For Uschold and King the disadvantage of using the top-down approach is that by starting with a few general concepts there may be some ambiguity in the final product. Alternatively, with the bottom-up approach too much detail may be provided, and not all this detail could be used in the final version of the ontology [22]. This in principle favors the middle-out approach proposed by Lakoff [23]. The middle-out is not only conceived as a middle path between bottom-up and top-down, but also relies on the understanding that categories are not simply organised in hierarchies from the most general to the most specific, but are rather organised cognitively in such a way that categories are located in the middle of 37

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

the general-to-specific hierarchy. Going up from this level is the generalisation and going down is the specialisation [3, 23]. C5. No life cycle is recommended. C6. No techniques or methods are recommended. The authors mention the importance of representing the captured knowledge but do not make explicit recommendations as to which knowledge formalism to use. This methodology does not support any particular software as a development tool. The integration with other ontologies is not described, nor is any method recommended to overcome this issue, nor is whether this integration involves extending the generated ontology or merging it with an existing one explained. C7. The methodology was used to generate the Enterprise ontology [30]. C8. Communities are not involved in this methodology. C9. For those activities specified within the building stage the authors do not propose any specific method for representing the ontology ( e.g. frames, description logic, etc). The authors place special emphasis on knowledge elicitation. However, they are not specific in developing this further. 1.2.2

The TOVE Methodology The Toronto Virtual Enterprise (TOVE) methodology involves building a logical

model of the knowledge that is to be specified by means of an ontology. The steps involved as well as their corresponding outcomes are illustrated in figure 2. C1. Gruninger and Fox propose a methodology which is heavily influenced by the development of knowledge based systems using first order logic [19]. C2. Gruninger and Fox do not provide specifics on the activities involved. C3. The authors emphasise competency questions as well as motivating scenarios as important components in their methodology. This methodology is application-semi dependent as specific terminology is used not only to formalise questions but also to build the 38

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

completeness theorems used to evaluate the ontology. Once the competency questions have been formally stated, the conditions under which the solutions to the questions must be defined should be formalised. The authors do not present information about the kind of domain experts they advise working with C4. This methodology adopts a middle-out strategy. C5. No indication about a life cycle is given. C6. Although Gruninger and Fox emphasised the importance of competency questions they do not provide techniques or methods to approach this problem. C7. The Toronto Vrtual Enterprise ontology was built using this methodology. C8. Communities are not involved in this methodology C9. No particular indication for eliciting knowledge is given.

Chapter 1 - Figure 2. The TOVE methodology.

39

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

1.2.3

The Bernaras methodology C1. Bernaras’ work was developed as part of the KACTUS [27] project which aimed to

investigate the feasibility of knowledge reuse in technical systems. This methodology is thus heavily influenced by knowledge engineering. C2. The original paper by Bernaras et al. provides little detail about the methodology. C3. This methodology is application-dependant. As the development of this methodology took place within a larger engineering effort ontologies were being developed hand-in-hand with the corresponding software. This implies that domain experts were being used for both tasks, for requirements interviews and studies as well as for ontology development. This however, does not mean that domain experts were taking an active role. The authors present very little information about the kind of domain experts they advise working with. C4. This methodology adopts a bottom-up [19]. C5. As the ontology is highly coupled with the software that uses it the life cycle of the ontology is the same as the software life cycle. C6. For the specific development of the ontology no particular methods or techniques are provided. However, as this methodology was meant to support the development of an ontology at the same time as the software it is reasonable to assume that some software engineering methods and techniques were also applied to the development of the ontology. C7. It has been applied within the electrical engineering domain. C8. Communities are not involved in this methodology C9. No particular indication for knowledge elicitation is provided.

40

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

1.2.4

The METHONTOLOGY methodology C1. METHONTOLOGY has its roots in knowledge engineering. The authors aim to

define a standardisation of the ontology life cycle (development) with respect to the requirements of the Software Development Process (IEEE 1074-1995 standard) [3]. C2. Detail is provided for the ontology development process; Figure 3 illustrates the methodology. It includes the identification of the ontology development process, a life cycle based on evolving prototypes, and particular techniques to carry out each activity [19] This methodology heavily relies on the IEEE software development process as described in [14]. Gomez-Perez et al. consider that all the activities carried out in an ontology development process may be classified into one of the following three categories: • Management activities: Including planning, control and quality assurance. Planning activities are those aiming to identify tasks, time and resources. • Development activities: Including the specification of the states, conceptualisation, formalisation, implementation and maintenance. From those activities related to the specification knowledge engineers should understand the context in which the ontology will be used. Conceptualisation activities are mostly those activities in which different models are built. During the formalisation phase the conceptual model is transformed into a semi-computable model. Finally, the ontology is updated, and corrected during the maintenance phase [31]. • Support activities: these include knowledge elicitation, evaluation, integration, documentation, and configuration management.

41

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 1 - Figure 3. METHONTOLOGY with permission from [19]

C3. Application independent. No indication is provided as to the kind of domain experts they advise working with. In principle METHONTOLOGY could be applied to the development of any kind of ontology. C4. This methodology adopts a middle-out C5. METHONTOLOGY adopts an evolving-prototype life cycle. C6. No methods or techniques recommended. METHONTOLOGY heavily relies on WebODE [32] as the software tool for coding the ontology. However, this methodology is in principle independent from the software tool. C7. This methodology has been used in the development of the Chemical OntoAgent [33] as well as in the development of the Onto2Agent ontology [33]. C8. No community involvement is considered. C9. Knowledge elicitation is part of the methodology, however o indication is provided as to which method to use. 42

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

1.2.5

The SENSUS methodology The SENSUS-based methodology [28] is a methodology supported on those

experiences gathered from building the SENSUS ontology. SENSUS is an extension and reorganisation of WordNet [34], this 70,000-node terminology taxonomy may be used as a framework into which additional knowledge can be placed [35]. SENSUS emphasises merging pre-existing ontologies, and mining other sources such as dictionaries. C1. SENSUS is not influenced by knowledge engineering as this methodology mostly relies on methods and techniques from text mining. C2. Although there is extensive documentation for those text mining techniques and developing structures for conceptual machine translation [36-38] no detail is provided as for the “how” to build the ontology. C3. As SENSUS makes extensive use of both text mining and conceptual machine translation the methodology as such is application semi-independent. The methods and techniques proposed by SENSUS may, in principle, be applied to several domains. C4. SENSUS follows a bottom-up approach. Initially instances are gathered, as the process moves forward abstractions are then identified. C5. No life cycle is identified; from those reported experiences the ontology is deployed on a one-off basis. C6. Methods and techniques are identified for gathering instances. However, no further detail is provided. C7. SENSUS was the methodology followed for the development of knowledge-based applications for the air campaign planning ontology [39]. C8. No community involvement is considered. C9. Knowledge elicitation is not considered explicitly.

43

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

1.2.6

DILIGENT Diligent (DIstributed, Loosely-controlled and evolvInG Engineering of oNTologies) is

one of the few methodologies engineered more for the Semantic Web (SW). The SW is a vision in which the current, largely human-accessible Web, is annotated using ontologies such that the vast content of the Web is available for machine processing [40]: “... an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work together to work in cooperation. It is the idea of having data on the

Web defined and linked in a way that it can be used for more effective discovery, automation, integration and reuse across various applictions... data can be shared and processed by automated tools as well as by people."[20, 40, 41] The goal of the SW is: “The goal of the Semantic Web initiative is as broad as that of the Web: to create a universal medium for the exchange of data. It is envisaged to smoothly interconnect personal information management, enterprise application integration, and the global sharing of commercial, scientific and cultural data. “Facilities to put machine-understandable data on the Web are quickly becoming a high priority

for many organizations, individuals and communities” [41] DILIGENT was conceived as a methodology for developing ontologies on a community basis. Although the DILIGENT approach assumes the active engagement of the community of practice throughout the entire process it does not give extensive details. Some particularities may be found reported for those cases in which DILIGENT has been used, for instance [42]. C1. DILIGENT is influenced by knowledge engineering as this methodology has been developed assuming the ontologies will be used by knowledge-based systems. However, DILIGENT introduces novel concepts such as the importance of the evolution of the ontology and the participation of communities within the development and life cycle of the ontology.

44

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

C2. DILIGENT provides some details specifically for those developments in which it has been used. C3. DILIGENT is application-dependant. There is no indication about the kind of domain experts they advise working with. C4. The selection between top-down, bottom-up or middle-out is problem dependent. No indication is given as to which strategy would be best to follow. C5. DILIGENT assumes an iterative life cycle in which the ontology is in constant evolution. C6. In principle DILIGENT does not recommend methods or techniques. By the same token DILIGENT is not linked to any software supporting, either the development, or the collaboration. C7. Some cases for which DILIGENT has been used have been reported, for instance see [42]. C8. The involvement of communities is considered in this methodology. C9. Although knowledge elicitation is considered in this methodology no special emphasis is placed on it.

1.3

WHERE IS THE MELTING POINT?

The considerable number of methodologies and the little detail provided by each of them makes it difficult to find a melting point. Some similarities and shortcomings are analyzed in this section. A summary of the comparison is given in Table 1.

45

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

W N O y bu to k

Strategy for identifying concepts

Recommend ed life cycle

Recommend ed methods, techniques and technology

Very little

AI

MOut

N/A

N/A

Small

Little

ASD

MOut

TBD

N/A

A lot

Very little

AD

TD

N/A

N/A

A lot

A lot

AI

MOut

EP

Multiple developments reported

Inexistent

Medium

ASD

N/A

TBD

Some activities missing. Technology recommended N/A

Small

Small

AI

DED/TED

TBD

N/A

Small

A lot

ASD

DED/TED

EP

Some techniques and methods recommend. Technology recommended

Community involvement

Knowledge elicitation

Uschold and King Gruninger and Fox Bernaras

N/A

N/A

N/A

N/A

N/A

N/A

N/A

Partially

N/A

N/A

N/A


Partially

Developments reported for the Bio domain


Supported

Business and foundational ontologies

Pinto

Swartout

Fernadez

Applicability

Strategy for building the ontology

Partial

Inheritance from knowledge engineering

Detail of the methodology

Application-independent= AI, Application-SemiDependent=ASD, Application-Dependant=AD Top-Down=TD, Bottom-Up=BU, Middle-Out=MOut, Domain Expert Dependent=DED, Terminology Extraction Dependent =TED N/A=Not available, To Be Detailed=TBD, Evolving Prototypes=EP

Chapter 1 - Table 1. Summary of methodologies. Reproduced and extended with permission from Fernadez, M. [1]

1.3.1

Similarities between methodologies Although the investigated methodologies are different from each other, it was possible

to identify some commonalities amongst them. Figure 4 illustrates those shared stages across all investigated methodologies except DILIGENT.

46

.d o

m

w

o

.c

Garcia

c u-tr a c k

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 1 - Figure 4. Similarities amongst methodologies.

DILIGENT presents some fundamental differences as it was engineered as a methodology for developing ontologies within geographically non-centralised settings. Those identified differences are listed below: • Life cycle: Within the DILIGENT methodology the ontology is constantly evolving, in a never-ending cycle. The life cycle of the ontology is understood as an open cycle in which the ontology evolves in a dynamic manner. • Collaboration: Within the DILIGENT methodology a group of people agrees on the formal specification of the concepts, relations, attributes, and axioms that the ontology should provide. This approach empowers domain experts in a way that sets DILIGENT apart from the other methodologies. • Knowledge elicitation: Due in part to the involvement of the community and in part to the importance of an agreement within the DILIGENT methodology knowledge elicitation is assigned a high level of importance as it supports the process by which consensus is reached. 1.3.2

Shortcoming of the methodologies. From the analysis previously presented it is clear that no single methodology brings

together everything that is needed when developing ontologies; methodologies have been 47

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

developed on an ad hoc basis. Some of the methodologies, such as those of Bernaras, provide information about the importance of the relationship between the final application using the ontology and the process by which the ontology is engineered. This consideration is not always taken from the beginning of the development; clearly the kind of ontology that is being developed heavily influences this relationship. For instance, foundational ontologies rarely consider the software using the ontology as an important issue; these ontologies focus more on fundamental issues affecting the classification system such as time, space, and events. They tend to study the intrinsic nature of entities independently from the particular domain in which the ontology is going to be used [20]. The final application in which the ontology will be used also influences the kind of domain experts that should be considered for the development of the ontologies. For instance, specialised domain experts are necessary when developing application ontologies, domain ontologies or task ontologies, but they tend not to have such a predominant role when building foundational ontologies. For these kinds of ontologies philosophers and broader knowledge experts are usually more suitable. None of the investigated methodologies provided real detail; the descriptions for the processes were scarce, and where present theoretical. No recollection was given about the ontology building sessions. The methods employed during the development of the ontologies were not fully described. For instance the reasons for choosing a particular method over a similar one were not presented. Similarly there was no indication as to what software should be used to develop the ontologies. METHONTOLOGY was a particular case for which there is a software environment associated to the methodology; the recommended software WebODE [32] was developed by the same group to be used within the framework proposed by their methodology. Although the investigated methodologies have different views on the life cycle of the ontology none of them, except for DILIGENT, considers the life cycle to be dynamic. This is reflected in the processes these methodologies propose. The development happens in a continuum; some parts within the methodologies are iterative processes, but the steps are 48

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

linear, taking place one after the other. In the case of DILIGENT the different view on the life cycle is clear. However, there is no clear understanding as to how this life cycle is dynamic and evolving; the authors don’t present any such discussion. The lack of support for the continued involvement of domain experts scattered around the world is a shortcoming in the investigated methodologies. As the SW poses a scenario in which information is highly decentralised, such a consideration is important. Biological sciences pose a similar scenario, in which domain experts are geographically distributed and the interaction takes place mostly on a virtual basis. Ontologies in the semantic web should not only be domain and/or task specific but also application oriented. Within the SW the construction of applications and ontologies will not always take place as part of the same software development projects. It is therefore important for these ontologies to be easily extensible; their life cycle is one in which the ontologies are in constant evolution, highly dynamic and highly reusable. Ontologies in biology have always supported a wide range of applications; MO for instance is used by several unrelated microarray laboratories information systems around the world. In both scenarios, SW and biology, not only is the structure of the ontology constantly evolving, but also the role of the knowledge engineer is not that of a leader but more that of a facilitator of collaboration and communication among domain experts. Parallels can be drawn between the biological domain and the SW. Pinto and coworkers [16] define SW-related scenarios as distributed, loosely controlled and evolving. As has been pointed out by Garcia et al. [7] domain experts in biological sciences are rarely in one place; they tend to form virtual organisations where experts with different but complementary skills collaborate in building an ontology for a specific purpose. The structure of the collaboration does not necessarily incorporate a central control and different domain experts join and leave the network at any time and decide on the scope of their contribution to the joint effort. Biological ontologies are constantly evolving, not only as new instances are added, but also as new whole/part-of properties are identified as new uses of the ontology are investigated. The

49

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

rapid evolution of biological ontologies is due in part to the fact that ontology builders are also those who will ultimately use the ontology [43]. Pinto and co-workers [16], as well as Garcia et al [7] have summarised those differences between classic proposals for building ontologies and those requirements added by the SW in four key points: • Distributed information processing with ontologies: Within the SW scenario, ontologies are developed by geographically distributed domain experts willing to collaborate, whereas KE deals with centrally developed ontologies. • Domain expert-centric design: Within the SW scenario, domain experts guide the effort while the knowledge engineer assists them. There is a clear and dynamic separation between the domain of knowledge and the operational domain. In contrast, traditional KE approaches relegate the role of the expert as an informant to the knowledge engineer. • Ontologies are in constant evolution in SW, whereas in KE scenarios, ontologies are simply developed and deployed. • Additionally, within the SW scenario, fine-grained guidance should be provided by the knowledge engineer to the domain experts. The lack of unified criteria makes it difficult to amalgamate methodologies; each group applies its own methodology adapting it to the specific problem they are addressing. Unfortunately due to the lack of detail in methods and techniques used in the investigated methodologies a unification of criteria is difficult. Collaboration is considered only by DILIGENT; however this methodology does not propose methods for engaging the collaborators. Moreover, knowledge elicitation whether within the context of collaboration or as a focus group activity is not addressed. METHONTOLOGY considers knowledge elicitation as part of the methodology, but there are no recommendations regarding knowledge elicitation methods. Collaboration, knowledge elicitation, a better understanding of the ontology life cycle and, more detail for those different involved steps are important information that should be described so the methodologies may be better replicated. There is also an increasing need for more reuse of methodologies rather than developing ad hoc de novo methodologies. These are precisely the issues that the methodology proposed in this thesis will address. Throughout 50

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

chapters three, four, and five the methodology as well as its corresponding methods and illustrative cases will be presented. It is based on real cases worked out within the biological domain as well as on a thoughtful analysis of previously proposed methodologies. In this chapter a comparison and analysis of existing methodologies has been presented. The framework that Fernandez [1] proposed has been extended and other dimensions have been added to the comparison. The corresponding summary is presented in Table 1.

1.4

ACKNOWLEDGEMENTS

The author specially thanks Oscar Corcho, and Mariano Fernandez for their extremely helpful suggestions.

1.5

REFERENCES

1.

Fernandez M: Overview Of Methodologies For Building Ontologies. In: In Proceedings of the IJCAI-99 Workshop on Ontologies and Problem-Solving Methods(KRR5): 1999; Stockholm, Sweden; 1999.

2.

Fernandéz M, Gómez-Pérez A, Juristo N: METHONTOLOGY: From Ontological Art to Ontological Engineering. In: Workshop on Ontological Engineering Spring Symposium Series AAAI97: 1997; Stanford; 1997.

3.

Mirzaee V: An Ontological Approach to Representing Historical Knowledge. MSc Thesis. Vancouver: Department of Electrical and Computer Engineering, University of British Columbia; 2004.

4.

Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools for Ontologies. The Agricultural Ontology Service (UN FAO) 2003.

5.

Lopez MF, Perez AG: Overview and Analysis of Methodologies for Building Ontologies. Knowledge Engineering Review 2002, 17(2):129-156.

6.

Noy NF, Hafner CD: The state of the art in ontology design - A survey and comparative review. AI Magazine 1997, 18(3):53-74. 51

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

7.

Garcia CA, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S: The use of concept maps during knowledge elicitation in ontology development processes - the nutrigenomics use case. BMC Bioinformatics 2006, 7:267.

8.

Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J et al: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25(1):25-29.

9.

Crosby MA, Goodman JL, Strelets VB, Zhang P, Gelbart WM, FlyBase-Consortium.: FlyBase: genomes by the dozen. Nucleic Acids Res 2007, 35:486-491.

10.

Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L, Mortimer RK et al: Genetic and physical maps of Saccharomyces Cerevisiae. Nature 1997, 387(6632 Suppl):67-73.

11.

Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, Mouse-Genome-DatabaseGroup.: The Mouse Genome Database (MGD): from genes to mice—a community resource for mouse biology. Nucleic Acids Res 2005, 33:471-475.

12.

Pankaj J, Shulamit A, Katica I, Elizabeth A, Kellogg. , Susan M, Anuradha P, Leonore R, Seung Y, Rhee., Martin M. S, Mary S et al: Plant Ontology (PO): a controlled vocabulary of plant structures and growth stages. Comparative and Functional Genomics 2005, 6(7-8):388-397.

13.

Lopez FM, Perez G-A, Sierra JP, Pazos SA: Building a Chemical Ontology Using Methontology and the Ontology Design Environment. IEEE Intelligent Systems & Their Applications 1999, 14(1):37-46.

14.

IEEE: IEEE standard for software quality assurance plans. In. Edited by IEEE, vol. 730-1998: IEEE Computer Society; 1998.

15.

IEEE: IEEE Standard Glossary of Software Engineering Terminology. In: IEEE Standards vol. IEEE Std 610.12-1990: IEEE; 1991.

16.

Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for Distributed, Loosely-controlled and evolving engineering of ontologies. In: European Conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

17.

Negnevitsky M: Artificial Intelligence: A Guide to Intelligent Systems, 2nd Rev Ed edition (27 Sep 2004) edn: Addison Wesley; 2004.

18.

Kendal S, Creen M: An Introduction to Knowledge Engineering. London: Springer-Verlag; 2006.

19.

Corcho O, Fernadez-Lopez M, Gomez-Perez A: Methodologies, tools, and languages for building ontologies. Where is their meeting point? Data and Knowledge Engineering 2003, 46(1):41-64.

52

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

20.

Sure Y: Methodology, Tools & Case Studies for Ontology based Knowledge Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe; 2003.

21.

Gangemi A, Guarino N, Masolo C, Oltramari A, Schneider L: Sweetening ontologies with DOLCE. In: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management Ontologies and the Semantic Web: 2002: SpringerVerlag 2002: 166-181.

22.

Uschold M, Gruninger M: Ontologies: Principles, methods and applications. Knowledge Engineering Review 1996, 11(2):93-136.

23.

Lakoff G: Women, fire, and dangerous things: what categories reveal about the mind. Chicago: Chicago University Press; 1987.

24.

Cooke N: Varieies of Knowledge Elicitation Techniques. International Journal of Human-Computer Studies 1994, 41:801-849.

25.

Uschold M, King M: Towards Methodology for Building Ontologies. In: Workshop on Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95: 1995; Cambridge, UK; 1995.

26.

Gruninger M, Fox MS: The Role of Competency Questions in Enterprise Engineering. In: Proceedings of the IFIP WG57 Workshop on Benchmarking - Theory and Practice: 1994; Trondheim, Norway; 1994.

27.

Bernaras A, Laresgoiti I, Correa J: Building and Reusing Ontologies for Electrical Network Applications. In: Proceedings of the European Conference on Artificial Intelligence (ECAI96). Budapest; 1996.

28.

Swartout B, Ramesh P, Knight K, Russ T: Toward Distributed use of Large-Scale Ontologies. In: Symposium on Ontological Engineering of AAAI: 1997: Stanford, California; 1997.

29.

Vrandecic D, Pinto HS, Sure Y, Tempich C: The DILIGENT Knowledge Processes. Journal of Knowledge Management 2005, 9(5):85-96.

30.

Uschold M, King M, Moralee S, Zorgios Y: The Enterprise Ontology. The Knowledge Engineering Review 1998, 13(Special issue on Putting Ontologies to Use).

31.

Fernadez-Lopez M, Gomez-Perez A: Overview and Analysis of Methodologies for Building Ontologies. The Knowledge Engineering Review 2002, 17(2):129-156.

32.

Arpirez JC, Corcho O, Fernadez-Lopez M, Gomez-Perez A: WebODE in a nutshell. AI Magazine 2003, 24(3):37-47.

33.

Arpirez JC, Gomez-Perez A, Lozano A, Pinto H, S.: Reference Ontology and ONTO2 Agent: The Ontology Yellow Pages. In: Workshop on applications of Ontologies and Problem-solving Methods, European Conference on Artificial Intelligence (ECAI98): 1998; Brighton, UK; 1998.

53

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

34.

Fellbaum C: WordNet, An Electronic Lexical Database. Boston: The MIT Press; 2000.

35.

ISI: Information Sciences Institute. In: SENSUS Ontology, http://www.isi.edu/naturallanguage/projects/ONTOLOGIEShtml. 2007.

36.

Knight K, Luck S: Building a large knowledge base for machine translation. In: Proceedings of the American Association of Artificial Intelligence: 1994; 1994: 773-778.

37.

Knight K, Chander I: Automated Postediting of Documents. In: Proc of the National Conference on Artificial Intelligence (AAAI): 1994; 1994.

38.

Knight K, Graehl J: Machine Trasnliteration. In: Proc of the Conference of the Association for Computational Linguistics (ACL): 1997; 1997.

39.

Valente A, Russ T, McGregor R, Swartout B: Building and (Re)Using an Ontology of Air Campaign Planning. IEEE Intelligent Systems & Their Applications 1999(January/February).

40.

Berners-Lee T: Weaving the Web: HarperCollins; 1999.

41.

W3C: Semantic Web Activity Statement. In: http://www.w3.org/2001/sw/Activity. 2007.

42.

Pinto S, Staab S, Sure Y, Tempich C: OntoEdit Empowering SWAP: a Case Study in Supporting DIstributed, Loosely-Controlled and evolvInG Engineering of oNTologies (DILIGENT). In: ESWS 2004: 2004; 2004: 16-30.

43.

Bada M, Stevens R, Goble C, Gil Y, Ashbourner M, Blake J, Cherry J, Harris M, Lewis S: A short study on the success of the GeneOntology. Journal of Web Semantics 2004, 1:235-240.

54

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

The melting point, a methodology for developing ontologies within decentralised settings This chapter addresses two research questions, “How should a well-engineered methodology facilitate the development of ontologies within communities of practice? What methodology should be used?” and “If ontologies are to be developed by communities then the ontology development life cycle should be better-understood within this context”. This chapter presents the proposed methodology, describes each step, highlight those novel components, and compare the methodology with alternatives. The methodology here presented is the product of experiences gathered from those scenarios reported in chapters 3, 4, and 5. Not only this methodology is based upon real cases but also, and more importantly, those steps, methods and techniques described here have been extensible tested. This is the first methodology engineered for decentralised communities of practice for which designers of technology and users may be the same group. The use of concept maps throughout the development process, the importance of the argumentative structures, and the usefulness of the narratives and text-mining techniques are among methods and techniques here described. Subsequent chapters present those experiences that allowed the author not only to test and extend the methodology but also to validate it. The author engineered the methodology, defined the steps, methods, and techniques involved. The investigation that allowed gathering all the information and data supporting this methodology was entirely conducted by Alex Garcia; the involvement of communities of practice as well as the identification of those areas for which there could be interest in helping the author to do his research were also activities carried out by Alex Garcia. Manuscripts as well as the corresponding journal and conference publications were written by Alex Garcia.

55

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

2

Chapter II - The melting point, a

methodology for developing ontologies within decentralised settings “…it is extremely difficult to judge the value of a methodology in an objective way. Experimentation is of course the proper way to do it, but it is hardly feasible because there are too many conditions that cannot be controlled… Introducing a toy problem will violate the basic assumption behind the need for a methodology: a complex development process.”

De Hoog R. 1988 2.1

INTRODUCTION

As presented in the previous chapter, building ontologies has been more of an ad hoc process rather than a well-engineered practice. It has been argued by several authors that to this day there are no agreed-upon standard methodology for building ontologies [1-3]. Nonetheless, there exist generic components fundamental to the ontology-building process, present in most or all ontology developments even if they are not explicitly identified. A detailed study of methodologies and those generic components was presented in Chapter 1. In the present chapter “The melting point, a methodology for developing ontologies within decentralized environments” those generic components are orchestrated in a coherent manner not only with the way communities build ontologies but also with the life cycle of these ontologies. The description of features and interrelationships is based upon experimentation and observation that took place during developments in real scenarios. It was possible not only to have direct access to domain experts but also to monitor the evolution and intended use of the ontology, Moreover; it was possible to study the processes by which the community got involved in the development of the ontology. Those previously proposed methodologies have been engineered for centralised settings, in which the ontology is developed and deployed on a one-off basis. The maintenance, as well as the evolution, of the ontology is left to the knowledge engineer and a 56

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

reduced group of domain experts. This same situation is also true for the whole process in which a reduced group of domain experts work together with the knowledge engineer during the development of the ontology; the community is not widely involved. Within the Semantic Web (SW), as well as within the biological domain, the involvement of communities of practice is crucial not only for the development, but also for the maintenance and evolution of ontologies. Domain experts in biological sciences are rarely in one place; they tend to form virtual organisations in which experts with different but complementary skills collaborate in building an ontology for a specific purpose. The structure of the collaboration does not necessarily have a central control; different domain experts join and leave the network at any time and decide on the scope of their contribution to the joint effort. Biological ontologies are constantly evolving; new classes, properties, and instances may be added at any time, and new uses for the ontology may be identified [2]. The rapid evolution of biological ontologies is due in part to the fact that ontology builders are also those who will ultimately use the ontology [4]. This chapter presents the methodology inferred from those scenarios for which it was possible to conduct experiments that allowed the author to understand the importance and impact of the community, as well as on the structure and orchestration of those fundamental components in the ontology building process. The initial section of this chapter presents a brief introduction stressing the important points that will be elaborated throughout this chapter; some terminological considerations are presented in the second section. This is followed by the presentation of the methodology and related information; methods, techniques, activities and tasks are also presented in section three. Section four presents the incremental evolutionary spiral model of tasks, activities and processes consistent with the life cycle. Sections five and six present discussion and conclusions.

57

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

2.2

TERMINOLOGICAL CONSIDERATIONS

Some of the common points across the previously proposed methodologies have been adapted for the present work. An important contribution to this methodology comes from observations made by Perez-Gomez et al. [5, 6], Fernandez et al. [7] Pinto et al. [8, 9], and Garcia et al. [2] -see chapters 3, 4, and 5 for more information on Garcia’s observations-. Both Fernandez et al. and Perez-Gomez et al. emphasise the importance of complying with the Institute of Electrical and Electronics Engineers (IEEE) standards, more specifically with the “IEEE standard for software quality assurance plans” [10]. In the context of the conclusions drawn in the previous chapter, such concern is understandable; not only does standards compliance ensure a careful and systematic planning for the development, but it also ensures the applicability of the methodology to a broad range of problems. Also, from the previous chapter it became clear that methodologies bring together techniques and methods in an orchestrated way so that the work can be done. A method is “an orderly process or procedure used in the engineering of a product or performing a service” [11]. A technique is defined as a “technical and managerial procedure used to achieve a given objective” [10]. Figure 1 illustrates in a more comprehensive manner these relationships.

Chapter 2 - Figure 1. Terminological relationships.

Greenwood [12] as well as Gomez-Perez et al. [13] present these terminological relationships in a simple way: “a method is a general procedure while a technique is the specific application 58

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

of a method and the way in which the method is executed” [13]. According to the IEEE [14] a process is a “function that must be performed in the software life cycle. A process is composed by activities”. The same set of standards defines an activity as “a constituent task of a process” [14]. A task is the atomic unit of work that may be monitored, evaluated and/or measured; a task is “a well defined work assignment for one or more project member. Related tasks are usually grouped to form activities” [14].

2.3

THE METHODOLOGY AND THE LIFE CYCLE

For the purpose of the proposed methodology it was decided that the activities involved would be framed within processes and activities, as illustrated in Figure 1; this conception is promoted by METHONTOLOGY [7] for centralised settings. As these activities were not conceived within decentralised settings, their scope has been redefined, so that they better fit the life cycle of ontologies developed by communities. The methodology here presented emphasises: decentralised settings and community involvement. It also stresses the importance of the life cycle these ontologies follow, and provides activities, methods and techniques coherently embedded within this life cycle. The methodology and the life cycle are illustrated in Figure 2. The overall process starts with documentation and management processes; the development process immediately follows. Managerial activities happen throughout the whole life cycle, as the interaction amongst domain experts ensures not only the quality of the ontology, but also that those predefined control activities take place. The development process has four main activities: specification, conceptualisation, formalisation and implementation, and evaluation. Different prototypes of the ontology are thus constantly being deployed. Initially these prototypes may be unstable, as the classes and properties may drastically change. In spite of this, the process evolves rapidly, achieving a stability that facilitates the use of the ontology; changes become more focused on the inclusion of classes and instances, rather than on the redefinition of the class hierarchy. 59

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 2 - Figure 2. Life cycle, processes, activities, and view of the methodology.

2.3.1

Documentation processes The documentation is a continuum process throughout the entire development of the

ontology. This documentation should make it possible for new communities of practice to get involved in the development of the ontology. 2.3.1.1

Activities for documenting the management processes

• Scheduling: Gantt charts are useful when scheduling processes, also simple spreadsheets, or Word documents may be used. • Control: flowcharts allow for a simple view of the process and those points for which there is the need to have a control activity. Although there are several software suites that assist in project management, some of them offering workgroup capabilities (see for instance http://www.mindtools.com/), large 60

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

biological ontology projects use simpler solutions such as those facilities Google offers for networking. Scheduling and controlling activities can be done by using Google calendar (http://www.google.com/calendar), by the same token sharing documents is facilitated by google documents (http://docs.google.com). When establishing communication and exchanging information email, wiki pages and voice over Internet Protocol (IP) systems have proven to be useful in projects such as the Ontology for Biomedical Investigations (OBI) [15] and the Microarray Ontology (MO) [16, 17]. A more-detailed description of the involvement of communities by high-tech means was published by Garcia et al. [2]. For both scheduling and controlling, the software tool(s) should in principle: • help to plan the activities and tasks that need to be completed, • give a basis for scheduling when these tasks will be carried out, • facilitate planning the allocation of resources needed to complete the project, • help to work out the critical path for a project where one must complete it by a particular date, • facilitate the interaction amongst participants, and • provide participants with simple means for exchanging information. 2.3.1.2

Documenting classes and properties

Although documentation happens naturally as discussions often take place on an email basis, it is often difficult to follow the argumentative thread. Even so, the information contained in mailing lists is useful and should whenever possible be related to classes and properties. Use cases, in the form of examples for which the use of a term is well-illustrated, should also be part of the documentation of classes and properties. Ontology editors allow domain experts to comment on the ontology; this kind of documentation is useful, as it reflects the understanding of the domain expert. For classes and properties there are three main sources of documentation: • mailing lists: discussions about why should a class be part of the ontology, why should it be part of a particular branch, how is it being used by the community, how a property relates two classes, and in general all discussions relevant to the ontology happen on an email basis. 61

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

• On-the-Ontology comments: in the cases when domain experts are familiarised with the ontology editor, they usually comment on classes and properties. • Use cases: this should be the main source of structured documentation provided by domain experts. However, gathering use cases is often difficult and time-consuming. The use cases should illustrate how a term is being used in a particular context, how the term is related to other terms, and those different uses or meanings a term may have. Guidance is available for the construction of use cases when developing software; however this direction is not available when building ontologies. From those experiences in which the author participated some general guide can be drawn, for instance: o use cases should be brief, o they should be based upon real-life examples, o knowledge engineers have to be familiar with the terminology as well as with the domain of knowledge because use cases are usually provided in the form of narratives describing processes, o graphical illustrations should be part of the use case, and also o whenever possible concept maps, or other related KA artefacts, should be used. 2.3.2

Management processes These start as soon as there is a decision for developing the ontology and continue

throughout the whole ontology development process. Managerial processes aim to assure the successful development of the ontology by providing domain experts with all that is needed. Also, managerial processes define general policies that allow the orchestration of the whole development. Some of the activities involved in the managerial processes are: 2.3.2.1

Scheduling

Scheduling identifies tasks, time and resources needed. 2.3.2.2

Control

Control warranties the planned tasks are completed. 2.3.2.3

Inbound-interaction

Inbound-interaction specifies how the interaction amongst domain experts will take place, for instance by phone calls, mailing lists, wiki pages and, web publications. 62

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

2.3.2.4

Outbound-interaction

As different communities should in principle be allowed to participate, there has to be an inclusion policy that specifies how a new community could collaborates and engages with the ongoing development. 2.3.2.5

Quality assurance

This activity defines minimal standards for those outputs from each and every process, activity or task carried out within the development of the ontology. For both inbound and outbound interactions, there are some key questions that should be addressed: • Which are the communities of practice involved in this development? • Are there going to be branches for the indented ontology? • What is the relationship between communities of practice and the branches of the ontology? • Are there going to be editors for the different branches of the ontology? 2.3.3

Development-oriented processes

2.3.3.1

Feasibility study and milestones

Feasibility study: This first activity involves addressing straightforward questions such as: what is the ontology going to be used for? How is the ontology ultimately going to be used by the software implementation? What do we want the ontology to be aware of, and what is the scope of the knowledge we want to have in the ontology? The milestones for this activity are: competency questions, scenarios in which it is foreseeable the ontology will be used, is there a “go” for the ontology? 2.3.3.2

Activities for the conceptualisation

• Domain Analysis (DA) and Knowledge Acquisition (KA): Knowledge elicitation (KE) is the process of collecting from a human source of knowledge, information that is relevant to that knowledge [18]. Knowledge acquisition includes the 63

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

elicitation, collection, analysis, modelling and validation of knowledge for knowledge engineering and knowledge management projects. The notion for both KA and KE comes from the development of knowledge bases; for the purposes of developing ontologies, KA and KE can be considered as transposable terms. Domain analysis is the process by which a domain of knowledge is analysed in order to find common and variable components that best describe that domain. KA and DA are interchangeable and complementary activities by which the information used in a particular domain is identified, captured and organised for the purpose of making it available in an ontology [19]. Both DA and KA are also part of the formalisation and implementation activities, the difference being the maturity level of the expected outcomes as well as the involved activities. When analyzing and acquiring knowledge, activities and tasks are more oriented to produce a baseline ontology, for instance the identification of recyclable and reusable ontologies, as well as of basic terminology that describes the domain. Identifying available sources of knowledge is also important; by doing so it is possible to better scope the ontology. More detailed information about reusable and recyclable ontologies may be found in chapter 2, section 3.1.2.2.2, and also in [2]. Reusing ontologies is not always instantly possible, however it is important to identify how to extend and adapt existing ontologies so collaborations with other groups developing ontologies becomes more fruitful. Baseline ontologies tend to lack formal definitions, whole/part-of relationships, and a stable is-a structure; those activities related to DA and KA focus more on capturing and representing knowledge in a more immediate manner and not necessarily on having logical expressions as part of the models; whereas when formalizing and evaluating an ontology activities and tasks are more oriented to include logical constrains and expressions. DA and KA may be seen as the ‘art of questioning’, since ultimately all relevant knowledge is either directly or indirectly in the heads of domain experts. This activity involves the definition of the terminology, i.e. the linguistic phase. This starts by the identification of those reusable ontologies and terminates with the baseline ontology, i.e. a draft version containing few but seminal elements of an ontology. The following criteria are important during knowledge acquisition [2]: o Accuracy in the definition of terms. The linguistic part of the ontology development is also meant to support the sharing of information/knowledge. The availability of context as part of the definition is useful when sharing knowledge. o Coherence: as concept maps (CMs) are enriched, it is important to ensure the coherence of the story we were capturing. Domain experts are asked to use the CMs as a means to tell a story; consistency within the narration is therefore crucial. o Extensibility: this approach may be seen as an aggregation problem; CMs are constantly gaining information, which is always part of a bigger narration. Extending the conceptual model is not only about adding more detail to the existing CMs, nor it is it just about generating new CMs; it is also about grouping 64

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

concepts into higher-level abstractions and validating these with domain experts. Scaling the models involves the participation of both domain experts and the knowledge engineer. It is mostly done by direct interview and confrontation with the models from different perspectives. The participation of new “fresh” domain experts, as well as the intervention of experts from allied domains, allows analyzing the models from different angles. This participatory process allows refactorising the models by increasing the level of abstraction. Throughout these activities Gruber’s design principles [20] such as those mentioned below have to be considered. • First design principle: “The conceptualization should be specified at the knowledge level without depending on a particular symbol-level encoding.” • Second design principle: “Since ontological commitment is based on the consistent use of the vocabulary, ontological commitment can be minimised by specifying the weakest theory and defining only those terms that are essential to the communication of knowledge consistent with the theory.” • Third design principle: “An ontology should communicate effectively the intended meaning of defined terms. Definitions should be objective. Definitions can be stated on formal axioms, and a complete definition (defined by necessary and sufficient conditions) is preferred over a partial definition. All definitions should be documented with natural language.” For the purpose of DA and KA it is critical to elicit and represent knowledge from domain experts. They do not, however, have to be aware of knowledge representation languages; this makes it important that the elicited knowledge is represented in a languageindependent manner. Researchers participating in knowledge elicitation sessions are not always aware of the importance of the session; however they are aware of their own operational knowledge. This is consistent with the first of Gruber’s design principles. Regardless of the syntactic format in which the information is encoded domain experts have to communicate and exchange information. For this matter it is usually the case that wide general theories, principles, broad-scope problem specifications are more useful when engaging domain experts in discussions, as these tend to contain only essential basic terms, known across the community and causing the minimal number of discrepancies, see the second design principle. As the community engages in the development process and the ontology grows, it becomes more important to have definitions that are usable by computer systems and humans, see the third design principle. 65

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

2.3. 3.2.1.1 M ILEST ONES ,

T ECHN IQ UES AN D T ASK S F OR T HE K A AN D D A

ACTIVIT IES

The milestones, techniques and tasks identified for DA and KA related activities are: • Tasks: focal groups, limited-information and constrained-processing tasks, protocol analysis, direct one-to-one interviews, terminology extraction, and inspection of existing ontologies. • Techniques: concept mapping, sorting techniques, automatic or semi-automatic terminology extraction, informal modelling and Ontology lookup service [21](OLS)1. • Milestones: Baseline ontology, knowledge sources, basic terminology, reusable ontologies. 2.3.3.3

Iterative Building of Ontology Models (IBOM). Iterative building of informal ontology models helps to expand the glossary of terms,

relations, their definition or meaning, and additional information such as examples to clarify the meaning where appropriate. Different models are built and validated with the domain experts. There is a fine boundary between the baseline ontology and the refined ontology; both are works in progress, but the community involved has agreed upon the refined ontology. 2.3. 3.3.1 M E T HO D S , T ECHNIQUES

AND

M I LES T ON ES

F O R T HE

IBOM.

The milestones, techniques and tasks identified for IBOM related activities are: • Methods: focal groups • Techniques: concept mapping, informal modelling with an ontology editor • Milestones: refined ontology.

1

The Ontology LookUp service (OLS) provides a user-friendly single entry point for querying publicly available ontologies in the Open Biomedical Ontology (OBO) format. By means of the OLS it is possible to verify if an ontology term has already been defined and in which ontology it available.

66

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

2.3.3.4

Formalisation

Formalisation of the ontology is the activity during which the classes are constrained, and instances are attached to their corresponding classes. For example: “a male is constrained to be an animal with a y-chromosome”. During the formalisation domain experts and knowledge engineers work with an ontology editor. When building iterative models and formalizing the ontology the model grows in complexity; instances, classes and properties are added, also logical expressions are built in order to have definitions with necessary and sufficient conditions. For both, formalisation and IBOM, Gruber’s fourth designing principle is applicable, as well as Noy and McGuinness’s guidelines [22]. • Fourth principle: “An ontology should be coherent: that is, it should sanction inferences that are consistent with de definitions. […] If a sentence that can be inferred from the axioms contradicts a definition or example given informally, then the ontology is inconsistent.” • Noy and McGuinness’s first guideline: “The ontology should not contain all the possible information about the domain: you do not need to specialise (or generalise) more than you need for your application.” • Noy and McGuinness’s second guideline: “subconcepts of a concept usually i) have additional relations that the superconcetp does not have, or ii) restrictions different from these of superconcepts, or iii) participate indifferent relationships than supperconcepts. In other words, we introduce a new concept in the hierarchy usually only when there is something that we can say about this concept that we cannot say about the superconcept. As an exception, concepts in terminological hierarchies do not have to introduce new relations”. • Noy and McGuinness’s third guideline: “If a distinction is important in the domain and we think of the objects with different values for the distinction as different kinds of objects, then we should create a new concept for the distinction”. • Noy and McGuinness’s fourth guideline: “A concept to which an individual instance belongs should not change often”. 2.3.3.5

Evaluation

There is no unified framework to evaluate ontologies, and this remains an active field of research [23]. When developing ontologies on a community basis three main evaluation activities have been identified:

67

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

2.3. 3.5.1 A P P LICAT ION - DEP END ENT

EVALUAT ION

It is considered that ontologies should be evaluated according to their fitness for purpose, i.e. an ontology developed for annotation purposes should be evaluated by the quality of the annotation and the usability of the annotation software [2]. The community carries out this type of evaluation in an interactive manner; as the ontology is being used for several purposes a constant feedback is generated. This makes it possible for the community to effectible guarantee the usability and the quality of the ontology. By the same token, the recall and precision of the data, and the usability of the conceptual query builder, should form the basis of the evaluation of an ontology designed to enable data retrieval. 2.3. 3.5.2 T ERMINOLOG Y

EVALUAT ION .

This activity was proposed by Perez-Gomez et al. [24]. The goal of the evaluation is to determine what the ontology defines, and how accurate these definitions are. Perez-Gomez et al. provides the following criteria for the evaluation: • Consistency: it is assumed that a given definition is consistent if, and only if, no contradictory knowledge may be inferred from other definitions and axioms in the ontology. • Completeness: it is assumed that ontologies are in principle incomplete [23, 24], however it should be possible to evaluate the completeness within the context in which the ontology will be used. An ontology is complete if and only if: o All that is supposed to be in the ontology is explicitly stated, or can be inferred. • Conciseness: an ontology is concise if it does not store unnecessary knowledge, and the redundancy in the set of definitions has been properly removed. 2.3. 3.5.3 T AXON OMY

EVALUAT ION

This evaluation is usually carried out by means of reasoned systems such as RACER [25] and Pellet [26]. The knowledge engineer checks for inconsistencies in the taxonomy, these may due to errors in the logical expressions that are part of the axioms.

68

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

2.3.3.6

A summary of the process.

Activities Feasibility study Conceptualisation

D e v e l o p m e n t

Formalisation P r Evaluation o c e s Techniques s Concept mapping Automatic or semiautomatic terminology extraction

Activities KA & DA

IBOM Competency questions Community agreement Competency questions Community evaluation Terminology evaluation Taxonomy evaluation

Task Defining the scope of the ontology Concept maps Informal ontology models focal groups Direct one to one interviewsfocal groups Direct one to one interviews Gathering lists of terms Refined ontology model

Suggested software CMAP Tools CMAP Tools, Protégé CMAP Tools CMAP Tools CMAP Tools Text2ONTO Protégé Protégé Protégé & RACER

Ontology modeling with an ontology editor

Chapter 2 - Table 1. A summary of the development process.

2.4

AN INCREMENTAL EVOLUTIONARY SPIRAL MODEL OF TASKS, ACTIVITIES AND PROCESSES

Ontologies, like software, evolve over time; specifications often change as the development proceeds, making a straightforward path to the ontology unrealistic. Different software process models have been proposed; for instance, linear sequential models, also known as waterfall models [27, 28] are designed for straight-line development. The linear sequential model suggests a systematic, sequential approach in which the complete system will be delivered once the linear sequence is completed [28]. The role of domain experts is passive, as end-users of technology. They are placed in a reacting role in order to give feedback to designers about the product. The software or knowledge engineer leads the process and controls the interaction amongst domain experts. 69

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

The prototyping model is more flexible as prototypes are constantly being built. Prototypes are built as a means for defining requirements [28], this allows for a more active role from domain experts. A quick design is often obtained in short periods of time. The model grows as prototypes are being released [23]; engineers and domain experts work on these quick designs. They focus on representational aspects of the ontology, while, the main development of the ontology (building the models, defining what is important, documenting, etc) is left to the knowledge engineer. A high-speed adaptation of the linear sequential model is the Rapid Application Development (RAD) model [29, 30]. This emphasises short development cycles for which it is possible to add new software components, as they are needed. RAD also strongly suggests reusing existing program components, or creating reusable ones [28]. The evolutionary nature of the software is not considered in either of the aforementioned models, from the software engineering perspective evolutionary models are iterative, and allow engineers to develop increasingly more complex versions of the software [28, 31, 32]. Ontologies are, in this sense, not any different from other software components for which process models have evolved from a “linear thinking” into evolutionary process models that recognise that uncertainty dominates most projects, that timelines are often impossibly short, and that iteration provides the ability to deliver a partial but extendible solution, even when a complete product is not possible within the time allotted. Evolutionary models emphasise the need for incremental work products, risk analysis, planning followed by plan revision, and customer (domain experts) feedback [28].

70

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

KA=Knowledge acquisition, DA=Domain Analysis, IBOM= Iterative Building of Ontology Models F=Formalisation, EVAL=Evaluation

Chapter 2 - Figure 3. An incremental evolutionary spiral model of tasks, activities and processes.

Figure 3 illustrates how tasks and activities are incremental in the spiral and how the process is constantly evolving. Activities such as Knowledge Acquisition (KA), Domain Analysis (DA), Iterative Building of Ontology Models (IBOM) Formalisation (F), and evaluation (EVAL) take place within the spiral; not necessarily following a strict order. Initially those processes related to management occur. As soon as there is a “GO” for the ontology development process these activities start with KA, DA, and IBOM. Once the first prototype of the ontology has been modelled then activities, tasks and processes can coexist simultaneously at some level of detail within the spiral. The process allows for its own incremental growth by facilitating the incorporation of other activities and/or processes such as Evaluation and Formalisation.

71

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

2.5

DISCUSSION

As discussed in the previous chapter METHONTOLOGY is the only methodology that rigorously complies with IEEE standards; this facilitates the applicability and extendibility of the methodology. Other methodologies, such as those studied in chapter 2 don’t intentionally meet the terms posed by the IEEE. However, some of the proposed activities by those ontologies may be framed within IEEE standards. Table 1 illustrates this comparison; in the same vein Table 1 in chapter 1 presents the proposed methodology within the comparison framework proposed in that chapter. The methodology here proposed reuses and adapts many components from METHONTOLOGY and other methodologies within the context of decentralised settings and participatory design. It also follows Sure’s [33] work as it considers throughout the whole process the importance of the software applications that will ultimately use the ontology. The work done by Sure is complementary to the one presented in this thesis, as both works study different edges of the same process: developing knowledgebased software. METHONTOLOGY allows for a controlled development and evolution of the ontology placing special emphasis on quality assurance (QA) thought the processes. Although QA is considered, the authors don’t propose any methods for this specific task. Management, development and support activities are carried out in a centralized manner; a limited group of domain experts interact with the knowledge engineer, conceptualize and prototype the ontology, successive prototypes are then built, the ontology gains more formality (e.g. logical constraints are introduced) until it is decided that the ontology may be deployed. Once the ontology has been deployed a maintenance process takes place. Neither the development nor the evolution of the ontology involves a decentralized community; the process does not assume a constant incremental growth of the ontology as it has been observed, and reported by [2] QA is also considered to be a centralized activity, contrasting with the way decentralized ontologies promote the participation of the community in part to ensure the quality of the delivered ontology. 72

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Project Management process

Project development-oriented process Pre-development process

Integral process

Development process

Postdevelopment process

Uschold and King

N/A

N/A

Requirement process Proposed

Design process N/A

Implementation process Proposed

N/A

Gruninger and Fox

N/A

N/A

Proposed

Proposed

Proposed

N/A

Bernaras Fernadez

N/A N/A

N/A N/A

Proposed Proposed

Proposed Proposed

Proposed Proposed

N/A

Swartout Pinto

N/A Distributed across members of the community

N/A N/A

Proposed Proposed, relies on community involvement

N/A N/A

Proposed Proposed

Garcia

Distributed across members of the community

Proposed, relies on community involvement

Proposed, relies on community involvement

Proposed

Proposed

N/A Assumes a SW post development process in which the community maintains the ontology Assumes a SW post development process in which the community maintains the ontology

Activities not identified for training, environment study, and configuration management Activities not identified for training, environment study, and configuration management N/A Activities not identified for training, environment study, and configuration management N/A Some outline for configuration management is given

Training activities identified; an outline for configuration management and environment study is also given.

N/A Not available

Chapter 2 - Table 2. Methodology compliance with IEEE Compliance 730-1998 [10] Reproduced and extended with permission from Fernadez, M. [34]

As those required ontologies grow in complexity so does the process by which they are obtained. A quick inspection of those previously proposed methodologies allows one to see how the involvement of communities has become a predominant requirement, not yet fully addressed by most methodologies. Methods, techniques, activities and tasks become more 73

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

group-oriented, making it necessary to re-evaluate the whole process as well as the way by which it is described. The IEEE proposes a set of concepts that should in principle facilitate the description of a methodology; however these guidelines should be better-scoped for decentralized environments. Activities within decentralised ontology developments are highly interrelated. However, the maturity of the product allows engineers and domain experts to determine boundaries, and by doing so establishing milestones for each and every activity and task. Although managerial activities are interrelated, and impact at a high-level those development processes it is advisable not to have rigid management structures. For instance, control and inboundoutbound activities usually coexist with some development activities when a new term needs to be added. This interaction requires the orchestration of all the activities to ensure the evolution of the ontology. An illustration of this situation and a feasible course of actions are presented in Figure 4.

Chapter 2 - Figure 4. Adding a term.

When communities are developing ontologies the life cycle varies. The ontology is not deployed on a one-off basis; there is thus no definitive final version of the ontology. The 74

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

involvement of the community allows for rapid evolution, as well as for very high quality standards; errors are identified and discussed then corrections are made available within short time frames. The model upon which this proposed methodology is based brings together ideas from, linear sequential modelling [28, 35] , prototyping, spiral [36], incremental [37, 38] and the evolutionary models [28, 39]. Due to the dynamic nature of the interaction when developing ontologies on a community basis the model grows rapidly and continuously. As this happens prototypes are being delivered, documentation is constantly being generated, and evaluation takes place at all times as the growth of the model is due to the argumentation amongst domain experts. The development process is incremental as new activities may happen without disrupting the evolution of the collaboration. The model is therefore an incremental evolutionary spiral in which tasks and activities can coexist simultaneously at some level of detail. As the process moves forward activities and/or tasks are applied recursively depending on the needs. The evolution of the model is dynamic and the interaction amongst domain experts and with the model happens all the time. Figure 3 illustrates the model as well as how processes, activities and tasks are consistent with the model.

2.6

CONCLUSIONS

The methodology proposed in this chapter reuses some components that various authors have identified as part of their methodologies. This thesis has investigated how to use these components within decentralised settings such as the biomedical domain. The proposed methodology is consistent with the challenges posed by the ontologies needed for the SW. The importance of this chapter is a detailed description of methods, techniques, activities, and tasks that could be used for developing community-based ontologies. Furthermore, this chapter has also explained the model for the life cycle of these ontologies. Both the methodology and the life cycle are consistent with those proposed processes. The fundamental contribution of this chapter is the involvement of communities as both domain 75

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

experts and subjects of study. This allowed the author to base his results in real-life cases. Successive chapters present engineering experiments from which the components presented in this chapter were studied.

2.7

ACKNOWLEDGEMENTS

The author specially thanks Oscar Corcho, and Mariano Fernandez for their extremely helpful suggestions.

2.8

REFERENCES

1.

Corcho O, Fernadez-Lopez M, Gomez-Perez A: Methodologies, tools, and languages for building ontologies. Where is their meeting point? Data and Knowledge Engineering 2003, 46(1):41-64.

2.

Garcia CA, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S: The use of concept maps during knowledge elicitation in ontology development processes - the nutrigenomics use case. BMC Bioinformatics 2006, 7:267.

3.

Mirzaee V: An Ontological Approach to Representing Historical Knowledge. MSc Thesis. Vancouver: Department of Electrical and Computer Engineering, University of British Columbia; 2004.

4.


5.

Perez AG: Some Ideas and Examples to Evaluate Ontologies. In: Knowledge Systems, AI Laboratory. Stanford: Stanford University; 1994a.

6.

Perez AG, Lopez MF, De Vicente A: Towards a Method to Conceptualize Domain Ontologies. In: Workshop on ontological Engineering ECAI'96: 1996; Budapest, Hungary; 1996: 41-51.

7.

Fernandéz M, Gómez-Pérez A, Juristo N: METHONTOLOGY: From Ontological Art to Ontological Engineering. In: Workshop on Ontological Engineering Spring Symposium Series AAAI97: 1997; Stanford; 1997. 76

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

8.

Pinto H, Sofia. , Martins P, Joao.: Ontoloigies: How can they be built? Knowledge and information Systems 2004, 6:441-463.

9.

Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for Distributed, Loosely-controlled and evolving engineering of ontologies. In: European Conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

10.

IEEE: IEEE standard for software quality assurance plans. In. Edited by IEEE, vol. 730-1998: IEEE Computer Society; 1998.

11.

IEEE: IEEE Standard Glossary of Software Engineering Terminology. In: IEEE Standards vol. IEEE Std 610.12-1990: IEEE; 1991.

12.

Greenwood E: Metodologia de la investigacion social. Buenos Aires: Paidos; 1973.

13.

Gomez-Perez A, Fernadez-Lopez M, Corcho O: Ontological Engineering. London: Springer-Verlag; 2004.

14.

IEEE: IEEE Standard for Developing Software Life Cycle Processes. In. Edited by IEEE, vol. IEEE Std 1074-1995: IEEE Computer Society; 1996.

15.

OBI, Ontology for Biological Investigations [http://obi.sourceforge.net/]

16.

Microarray Gene Expression Data [http://www.mged.org/]

17.

Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing functional genomics experiments. Comparative and Functional Genomics 2003, 4:127132.

18.

Cooke N: Varieies of Knowledge Elicitation Techniques. International Journal of Human-Computer Studies 1994, 41:801-849.

19.

Gaines B, R.,, Shaw ML, Q.: Knowledge acquisition tools based on personal construct psychology. The Knowledge Engineering Review 1993, 8(1):49-85.

20.

Gruber R, Tom.: Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In: International Workshop on Formal Ontology: 1993; Padova, Italy,; 1993.

21.

Cote R, Jones P, Apweiler R, Hermjakob H: The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries. BMC Bioinformatics 2006, 7(97).

22.

Noy NF, L. MD: Ontology Development 101: a Guide to Creating Your First Ontology. In: Protege Documentation. Stanford, CA: Stanford University; 2001.

23.

Perez AG, Fernadez-Lopez M, Corcho O: Ontological Engineering: Springer; 2004.

77

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

24.

Perez AG, Juristo N, Pazos J: Evaluation and assessment of knowledge sharing technology In: Towards Vey Large Knowledge Bases: Knowledge Building and Knowledge Sharing(KBK95): 1995; Amsterdan, The Netherlands: IOS Press; 1995: 289-296.

25.

Haarslev V, Möller R: Racer: A Core Inference Engine for the Semantic Web. In: Proceedings of the 2nd International Workshop on Evaluation of Ontology-based Tools (EON2003): October 20 2003; Sanibel Island, Florida, USA; 2003: 27-36.

26.

Sirin E, Parsia B, Cuenca-Grau B, Kalyanpur A, Katz Y: Pellet: A practical OWLDL resoner. Journal of Web Semantics 2007, 5(2).

27.

Eden HA, Hirshfeld Y: Principles in formal specification of object oriented design and architecture. In: Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research: 2001; Toronto, Canada: IBM Press; 2001.

28.

Pressman S, Roger.: Software Engineering, A practitioners Approach, Fifth edn: Thomas Casson; 2001.

29.

Kerr J, Hunter R: Inside RAD: Mc-Graw-Hill; 1994.

30.

Martin J: Rapid Application Development: Prentice-Hall; 1991.

31.

Gilb T: Principles of Software Engineering Management: Addison-Wesley Longman; 1988.

32.

Gilb T: Evolutionary Project Management: Multiple Performance, Quality and Cost Metrics for Early and Continuous Stakeholder Value Delivery. In: International Conference on Enterprise Information Systems: April 14-17 2004; Porto, Portugal.; 2004.

33.

Sure Y: Methodology, Tools & Case Studies for Ontology based Knowledge Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe; 2003.

34.

Fernandez M: Overview Of Methodologies For Building Ontologies. In: In Proceedings of the IJCAI-99 Workshop on Ontologies and Problem-Solving Methods(KRR5): 1999; Stockholm, Sweden; 1999.

35.

Dagnino A: Coordination of hardware manufacturing and software developmentlifecycles for integrated systems development. In: IEEE International Conference on Systems, Man, and Cybernetics: 2001; 2001: 1850-1855.

36.

Boehm B: A spiral model of software development and enhancement. ACM SIGSOFT Software Engineering Notes 1986, 11(4):14-24.

37.

McDermid J, Rook P: Software Developement Process Models. In: Software Engineer's Reference Book. CRC Press; 1993: 15-28.

38.

Larman C, Basili R, Victor. : Iterative and Incremental Development: A Brief History Computer, IEEE Computer Society 2003, 36:47-56.

39.

May L, Elaine, Zimmer A, Barbara.: The Evolutionary Development Model for Software. HP Journal 1996:http://www.hpl.hp.com/hpjournal/96aug/aug96a94.htm. 78

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

The use of concept maps during knowledge elicitation in ontology development processes A critical assessment of the state of the art of methodologies for developing ontologies was initially presented in this thesis work. It was subsequently followed by the presentation of the proposed methodology; this methodology is the product of several experiments and analysis addressing some of the previously identified key issues when developing ontologies in communities of practice such as the biological domain. This chapter is divided in two sections; the first one presents the process, results, and conclusions for one the experiments upon which the proposed methodology relays on. Some specific issues were addressed when conducting this experience; for instance, how could the knowledge elicitation process be supported throughout the entire process? How could domain experts be engaged in such a way that interaction could be facilitated? Which parts of those previously proposed methodologies could be applied within this setting? Important information was gathered from this experience, not only methodological aspects were identified, but also the importance of conceptual maps was documented and well established as part of the development process. The second part of this chapter presents another scenario (an ontology for a genealogy management system) for which those identified steps were also evaluated. The contributions of this chapter are the thorough description of the suggested steps when building an ontology, example use of concept maps, consideration of applicability to the development of lower-level ontologies and application to decentralised environments. Other authors had previously used conceptual maps when eliciting knowledge, but this was the first reported experience of the use of concept maps with the specific aim of developing ontologies. It was also found that within the specific presented scenario conceptual maps played an important role in the development process. Another important outcome from this experience was the evidence supporting the importance of communities and how these were 79

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

interacting when building ontologies. The author investigated and identified those reusable steps from other methodologies applicable for this specific environment; Alex Garcia also identified and conceptualised the use of conceptual maps when developing ontologies as well as those different stages within the development process for which conceptual maps could play a role. As the knowledge engineer in charge of this experiment Alex Garcia could also explore and document the role of both, domain experts and knowledge engineers. Manuscripts leading to those published papers aroused from this chapter were written by Alex Garcia.

AUTHORS' CONTRIBUTIONS Susanna Sansone conceived of and coordinated the project. Alex Garcia Castro was a knowledge engineer during his 11-month student project at EBI. Philippe Rocca-Serra coordinated the nutrigenomics community within MGED RSBI, and organised and participated in the knowledge elicitation exercises. Karim Nashar contributed to the knowledge elicitation exercises. Robert Stevens assisted Alex Garcia Castro in conceptualising the methodology, Susanna Sansone and Philippe Rocca-Serra supervised the knowledge elicitation exercises and, with Chris Taylor, the associated meetings. Alex Garcia Castro wrote the initial version of the manuscript; contributions and critical reviews by the other authors, in particular Susanna Sansone and Robert Stevens, delivered the final manuscript.

PUBLISHED PAPER ARISING FROM THIS CHAPTER – FIRST SECTION Garcia Castro A, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S: The use of concept maps during knowledge elicitation in ontology development processes - the nutrigenomics use case. BMC Bioinformatics 2006, 7:267.

80

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

PUBLISHED PAPER ARISING FROM THIS CHAPTER – SECOND SECTION Garcia Castro A, Sansone S, Rocca-Serra P, Taylor C, Ragan MA: The use of concept maps for two ontology developments: nutrigenomics, and a management system for genealogies. In: 8th Intl Protégé Conference Protégé: 2005; Madrid, Spain; 2005: 59-62.

81

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

3

Chapter III - The use of concept maps during knowledge elicitation in ontology development processes

3.1

THE USE OF CONCEPT MAPS DURING KNOWLEDGE ELICITATION IN ONTOLOGY DEVELOPMENT PROCESSES – THE NUTRIGENOMICS USE CASE

Abstract. Incorporation of ontologies into annotations has enabled ‘semantic integration’ of complex data, making explicit the knowledge within a certain field. One of the major bottlenecks in developing bioontologies is the lack of a unified methodology. Different methodologies have been proposed for different scenarios, but there is no agreed-upon standard methodology for building ontologies. The involvement of geographically distributed domain experts, the need for domain experts to lead the design process, the application of the ontologies and the life cycles of bio-ontologies are amongst the features not considered by previously proposed methodologies. Here, we present a methodology for developing ontologies within the biological domain. We describe our scenario, competency questions, results and milestones for each methodological stage. We introduce the use of concept maps during knowledge acquisition phases as a feasible transition between domain expert and knowledge engineer. The contributions of this paper are the thorough description of the steps we suggest when building an ontology, example use of concept maps, consideration of applicability to the development of lower-level ontologies and application to decentralised environments. We have found that within our scenario concept maps played an important role in the development process.

3.1.1

Background In the field of biological research, recent advances in functional genomics technologies

have given the opportunity to carry out complex and possibly high-throughput investigations. Consequently, the storage, management, exchange and description of data in this domain present challenges to biologists and bioinformaticians. It is widely recognised that capturing descriptions of investigations at a high level of granularity is necessary to enable efficient data sharing and meaningful data mining [1, 2]. However, this information is often captured in diverse formats, mostly as free text, and is commonly subject to typographical errors. The increased cost of interpreting the experimental procedures and exploring data has encouraged several scientific communities to develop and adopt ontology-based knowledge representations to extend power of their computational approaches [3]. 82

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Application of an ontologically based approach should be more powerful than simple keyword-based methods for information retrieval. Not only can semantic queries be formed, but axioms that specify relations among concepts can also be provided, making it possible for a user to derive information that has been specified only implicitly. In this way, relevant entries and text can be found even if none of the query words is present (e.g. a query for “furry quadrupeds” might retrieve pages about bears) [4]. Many methodologies for building ontologies have been described [5] and seminal work in the field of anatomy provides insights into how to build a successful ontology [6, 7]. Extensive work about the nature of the relations that can be used also provides solid grounds for consistent development for building ontologies [8]. However, despite these efforts, bioontologies still tend to be built on an ad hoc basis rather than by following a well-defined engineering process. To this day, no standard methodology for building ontologies has been agreed upon. Usually terminology is gathered and organised into a taxonomy, from which key concepts are identified and related to create a concrete ontology. Case studies have been described for the development of ontologies in diverse domains, although surprisingly only one of these has been reported to have been applied in a domain allied to bioscience – the chemical ontology [9] – and none in bioscience per se. Most of the literature focuses on issues such as the suitability of particular tools and languages for building ontologies, with little attention being given to how it should be done. This is almost certainly because the main interest has been in reporting content and use, rather than engineering methodology. Nevertheless, it is apparent that most ontologies are built with the ontological equivalent of “hacking”. A particular lack in these methodologies is support for the continued involvement of domain experts scattered around the world. Biological sciences pose a scenario in which domain experts are geographically distributed, the structure of the ontology is constantly evolving, and the role of the knowledge engineer is not that of the leader but more of the one who promotes collaboration and communication among domain experts. Bioinformatics has

83

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

demonstrated a need for bio-ontologies and several characteristics highlight the lack of support for these requirements: • The volatility of knowledge in the domain – biologists’ understanding of the domain is in continual flux; • The domain is large, complex, and cannot, therefore be modelled in one single effort; the knowledge holders are distributed and will not be brought together for frequent knowledge elicitation exercises. To support these requirements, our methodology pays particular attention to the knowledge elicitation stage of the process of building an ontology. This is the stage where the person managing the development of the ontology gathers, in the form of concepts and relationships between concepts, what the domain expert understands to exist in that domain. To do this, we used concept maps (CMs), a simple graphical representation in which instances and classes are presented as nodes, and relationships between them are shown as arcs [10]. CMs have a simple semantics that appears to be an intuitive form by which domain experts can convey their understanding of a domain. We exploit this feature in order to perform the informal modelling stage of building an ontology. In support of this argument, we first present a survey of ontology development methodologies, and then report our experience, with particular focus on the how of the initial stages of building an ontology using CMs. We have studied and evaluated the key methodologies and have adapted parts of several of them to produce an overall method, which we describe here as a set of detailed stages that, we argue, can be applied to other domains within the biological sciences. The major contributions of this paper are the thorough description of our methodology for building an ontology (including an examination of the utility of CMs), the consideration of its applicability to the development of ontologies, and the assessment of its suitability for use in decentralised settings. Finally, we discuss the issues raised and draw conclusions. 3.1.1.1

A survey of methodologies We investigated five methodologies: Enterprise Methodology [11], TOVE (Toronto

Virtual Enterprise) [12, 13], the Unified Methodology [14, 15], Diligent [16] and 84

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Methontology [17]. Table 1 presents a summary of our comparison. We analysed these approaches according to the following criteria: • Accuracy in the description of the stages: We were interested in knowing if the stages were sufficiently described so they could be easily followed. • Terminology extraction: We wanted to study how terminology extraction could assist knowledge engineers and domain experts when building ontologies. We were interested in those methodologies that could offer some level of support for identifying terms. • Generality: We needed to know how dependent on a particular intended use the investigated methodologies are. This point was of our particular interest since our ontology was intended to serve a particular task. This parameter may be assimilated to the ability of the method to be applied to a different scenario, or use of the ontology it self. • Ontology evaluation: We needed to know how we could evaluate the completeness of our ontology. This point was interesting for us since we were working with agreements within the community, and domain experts could therefore agree upon errors in the models. • Distributed and decentralised: We were interested in those methodologies that could offer support for communities such as ours in which domain experts were not only geographically distributed but also organised in an atypical manner (i.e. not fully hierarchical). • Usability: We had a particular interest in those methodologies for which real examples had been reported. Had the methodology been applied to building a real ontology? • Supporting software: We were interested in knowing whether the methodology was independent from particular software. We found that only Diligent offered community support for building ontologies and none of them had detailed descriptions about knowledge elicitation, nor did they have details on the different steps that had to be undertaken. The methodologies mentioned above have been applied mostly in controlled environments where the ontology is deployed on a one-off basis. Tools, languages and methodologies for building ontologies has been the main research goal for many computer scientists; whereas for the bioinformatics community, it is just one step in the process of developing software to support tasks such as annotation and text mining.

85

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Enterprise Methodology High-level description of stages

TOVE Methodology Detail is provided for those ontologies developed with this methodology

Unified Methodology High-level description of stages

N/A

N/A

N/A

Not domain specific Competency questions

Not domain specific Competency questions and formal axioms

Not domain specific No evaluation method is provided

Distributed / decentralised

No

No

No

Usability

N/A

Supporting software

N/A

Description of stages

Terminology extraction Generality Ontology evaluation

Business and foundational N/A ontologies N/A N/A

Methodology

Diligent

Stages are High level described. description More detail is provided for specific developments: chemical and legal ontology N/A N/A Not domain specific An informal evaluation method is used for the Chemical ontology No Chemical ontology Legal ontology WebODE

Not domain specific The community evaluates the ontology; agreement process Yes

N/A N/A

Chapter 3 - Table 1. Comparison of methodologies.

Unfortunately, none of the methodologies investigated was designed for the requirements of bioinformatics, nor has any of them been standardised and stabilised long enough to have a significant user community (i.e. large enough for the ontology to have an impact on the community) [18]. Theoretically, the methodologies are independent from the domain and intended use. However, none of the methodologies has been used long enough as to provide evidence of its generality. They had been developed in order to address a specific problem or as an end by it self. The evaluation of the ontology remains a difficult issue to address; there is a lack of criteria for evaluating ontologies. Within our particular scenario, the models were being built upon agreements between domain experts. Evaluation was therefore based upon their knowledge and thus could contain “settled” errors. We studied those knowledge elicitation methods described by [19] such as observation, 86

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

interviews, process tracing, conceptual methods, and card sorting. Unfortunately, none of them was described within the context of ontology development in a decentralised setting. We drew parallels between the biological domain and the Semantic Web (SW). This is a vision in which the current, largely human-accessible Web, is annotated from ontologies such that the vast content of the Web is available to machine processing [20]. Pinto and co-workers [21] define these scenarios as distributed, loosely controlled and evolving. Domain experts in biological sciences are rarely in one place; they tend to form virtual organisations where experts with different but complementary skills collaborate in building an ontology for a specific purpose. The structure of the collaboration does not necessarily have a central control and different domain experts join and leave the network at any time and decide on the scope of their contribution to the joint effort. Biological ontologies are constantly evolving, not only as new instances are added, but also as new whole/part-of properties are identified as new uses of the ontology are investigated. The rapid evolution of biological ontologies is due in part to the fact that ontology builders are also those who will ultimately use the ontology [22]. Some of the differences between classic proposals from Knowledge Engineering (KE) and the requirements of the SW, have been presented by Pinto and co-workers [21], who summarise these differences in four key points: 1. Distributed information processing with ontologies: within the SW scenario, ontologies are developed by geographically distributed domain experts willing to collaborate, whereas KE deals with centrally-developed ontologies. 2. Domain expert-centric design: within the SW scenario, domain experts guide the effort while the knowledge engineer assists them. There is a clear and dynamic separation between the domain of knowledge and the operational domain. In contrast, traditional KE approaches relegate the role of the expert as an informant to the knowledge engineer. 3. Ontologies are in constant evolution in SW, whereas in KE scenarios, ontologies are simply developed and deployed. 87

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

4. Additionally, within the SW scenario, fine-grained guidance should be provided by the knowledge engineer to the domain experts. We consider these four points to be applicable within biological domains, where domain experts have crafted ontologies, taken care of their evolution, and defined their ultimate use. Our proposed methodology takes into account all the considerations reported by Pinto and co-workers [21], as well as those previously studied by the knowledge representation community. 3.1.2

Methods

3.1.2.1

General view of our methodology

A key feature of our methodology is the use of CMs throughout our knowledge elicitation process. CMs are graphs consisting of nodes representing concepts, connected by arcs representing the relationships between those nodes [23]. Nodes are labelled with text describing the concept that they represent, and the arcs are labelled (sometimes only implicitly) with a relationship type. CMs proved, within our development, useful both for sharing and capturing activities, and in the formalisation of use cases. Figure 1 illustrates a CM. Our methodology strongly emphasises: (i) capturing knowledge, (ii) sharing knowledge, (iii) supporting needs with well-structured use cases, and (iv) supporting collaboration in distributed (decentralised) environments. Figure 2 presents those steps and milestones that we envisage to occur during our ontology development process.

88

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 3 - Figure 1. View of a concept map. Adapted with permission from: http://cmap.coginst.uwf.edu/info/

Chapter 3 - Figure 2. Steps (1-6) and milestones (boxes).

Step 1: The first step involves addressing straight forward questions such as: what is the ontology going to be used for? How is the ontology ultimately going to be used by the

89

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

software implementation? What do we want the ontology to be aware of, and what is the scope of the knowledge we want to have in the ontology? Step 2: When identifying reusable ontologies, it is important to focus on what any particular concept is used for, how it impacts on and relates to other concepts, how it is embedded within the process to which it is relevant, and how domain experts understand it. It is not important to identify exact linguistic matches. By recyclability of different ontologies, we do not imply that we can indicate which other ontology should be used in a particular area or problem; instead, we mean conceptually how and when one can extrapolate from one context to another. Extrapolating from one context to another largely depends on the agreement of the community, and specific conditions of the contexts involved. Indicating where another ontology should be used to harmonise the representation at hand – for example, between geographical ontologies and the NCBI (National Center for Biotechnology Information) taxonomy – is a different issue that we refer to as reusability. Step 3: Domain analysis and knowledge acquisition are processes by which the information used in a particular domain is identified, captured and organised for the purpose of making it available in an ontology. This step may be seen as the ‘art of questioning’, since ultimately all relevant knowledge is either directly or indirectly in the heads of domain experts. This step involves the definition of the terminology, i.e. the linguistic phase. This starts by the identification of those reusable ontologies and terminates with the baseline ontology, i.e. a draft version containing few but seminal elements of an ontology. We found it important to maintain the following criteria during knowledge acquisition: • Accuracy in the definition of terms. The linguistic part of our development was also meant to support the sharing of information/knowledge. Table 2 presents the structure of our linguistic definitions. The availability of context as part of the definition proved to be useful when sharing knowledge. • Coherence: as CMs were being enriched it was important to ensure the coherence of the story we were capturing. Domain experts were asked to use the CMs as a means to tell a story; consistency within the narration was therefore crucial. • Extensibility: Our approach may be seen as an aggregation problem; CMs were constantly gaining information, which was always part of a bigger narration. Extending the conceptual model was not only about adding more details to the 90

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

existing CMs, nor it was it just about generating new CMs; it was also about grouping concepts into higher-level abstractions and validating these with domain experts. Scaling the models involved the participation of both domain experts and the knowledge engineer. It was mostly done by direct interview and confrontation with the models from different perspectives. The participation of new “fresh” domain experts as well as the intervention of experts from allied domains allowed us to analyse the models from different angles. This participatory process allowed us to re-factorise the models by increasing the level of abstraction. Word Verb/Noun Definition Context

Notes

Investigation Noun An Investigation is a set, a collection of related studies and assays; a selfcontained contained unit of scientific enquiry. Evaluating the effect of an ingredient in a diet traditionally relies on one or more related studies for example where the subject receive different concentrations of the ingredient. The concept of investigation provides a container that allows us to group these studies together. When can we consider an investigation completed? Ongoing discussion. For instance, according to the Minimal Information About a Microarray Experiment (MIAME) an Experiment is a set of related hybridisation that are in some way related (e.g. related to the same publication). In the case of the Investigation, we do not want to tie this concept to a publication or a deposition to a database or a submission to regulatory authority. The decision should be left to the individual investigator.

Chapter 3 - Table 2. Example of the structure of linguistic definitions.

The goal determines the complexity of the process. Creating an ontology intended only to provide a basic understanding of a domain may require less effort than creating one intended to support formal logical arguments and proofs in a domain. We must answer questions such as: Why are we building this ontology? What do we want to use it for? How is it going to be used by the software layer? Subsections Identification of purpose, scope,

competency questions and scenarios to Iterative building of informal ontology models explain these steps in detail. Step 4: Iterative building of informal ontology models helped to expand our glossary of terms, relations, their definition or meaning, and additional information such as examples to clarify the meaning where appropriate. Different models were built and validated with the domain experts. 91

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Step 5: Formalisation of the ontology was the step during which the classes were constrained, and instances were attached to their corresponding classes. For example: “a male is constrained to be an animal with a y-chromosome”. This step involves the use of an ontology editor. Step 6: There is no unified framework to evaluate ontologies, and this remains an active field of research. We consider that ontologies should be evaluated according to their fitness for purpose, i.e. an ontology developed for annotation purposes should be evaluated by the quality of the annotation and the usability of the annotation software. By the same token, the recall and precision of the data, and the usability of the conceptual query builder, should form the basis of the evaluation of an ontology designed to enable data retrieval. 3.1.2.2

Scenarios and ontology development process

The methodology we report herein has been applied during the knowledge elicitation phase with the European nutrigenomics community (NuGO) [24]. Nutrigenomics is the study of the response of a genome to nutrients, using “omics” technologies such as genomicscale mRNA expression (transcriptomics), cell and tissue-wide protein expression (proteomics), and metabolite profiling (metabolomics) in combination with conventional methods. NuGO includes twenty-two partner organisations from ten European countries, and aims to develop and integrate all facets of resources, thereby making future nutrigenomics research easier. An ontology for nutrigenomics investigations would be one of these resources, designed to provide semantics for those descriptors relevant to the interpretation and analysis of the data. When developing an ontology involving geographically distributed domain experts, as in our case, the domain analysis and knowledge acquisition phases may become a bottleneck due to difficulties in establishing a formal means of communication (i.e. in sharing knowledge). Additionally, the NuGO participants collaborate with international toxicogenomics and environmental genomics communities under the RSBI (Reporting Structure for Biological Investigations) [25], a working group of the Microarray Gene Expression Data (MGED) 92

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Society. One of the objectives of RSBI is the development of a common high-level abstraction defining the semantic and syntactic scaffold of a record/document that describes an investigation in these diverse biological domains. The RSBI groups will validate the highlevel abstraction against complex uses cases from their domain communities, ultimately contributing to the Functional Genomics Ontology (FuGO), a large international collaborative development project [26]. Application of our methodology in this context, with geographically distributed groups, has allowed us to examine its applicability and understand the suitability of some of the tools currently available for collaborative ontology development. 3.1. 2.2.1 I DEN T IF ICATI ON

OF P URP OSE , SCOP E , COMP ET ENCY QUEST IONS

AND SCENAR IOS

Whilst the high-level framework of the nutrigenomics ontology will be build as a the collaborative effort with the others MGED RSBI groups, the lower-level framework aims to provide semantics for those descriptors specific to the nutritional domain. Having defined the scope of the ontology we discussed the competency questions with our nutrigenomics researchers (henceforth our domain experts); these were used at a later stage in order to help evaluate our model. Examples of those competency questions are presented in Table 3. Which investigations were done with a high-fat-diet study? Which study employs microarray in combination with metabolomics technologies? List those studies in which the fasting phase had as duration one day. Chapter 3 - Table 3. Examples of competency questions

Competency questions are understood here as those questions for which we want the ontology to be able to provide support for reasoning and inferring processes. We consider ontologies do not answer questions, although they may provide support for reasoning processes. Domain experts should express the competency questions in natural language without any constraint. 93

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

3.1. 2.2.2 I DEN T IF ICAT ION

OF REUSA B LE A ND RECYCLAB LE ON T OLOGIES

For our particular purposes, we followed a ‘top-down’ approach where experts in the biological domain work together to identify key concepts, then postulate and capture an initial high-level ontology. We decided to follow this approach because of the availability of highlevel domain experts who could provide a more general picture. We identified for example the Microarray Gene Expression Data (MGED) Ontology (henceforth, MO) [27] as a possible ontology from which we could recycle - extrapolate from one context to anothersome terms and/or structure for investigation employing other omics technologies in addition to expression microarrays. The Open Biomedical Ontologies project (OBO) [28, 29] was an invaluable source of information for the identification of possible orthogonal ontologies. Domain experts and the knowledge engineer worked together in this task; in our scenario, it was a process where we focused on those high-level concepts that were part of MO and relevant for the description of a complete investigation. We also studied the structure that MO proposes, and by doing so came to appreciate that some concepts could be linguistically different but in essence mean very similar things. This is an iterative process currently done as part of the FuGO project. FuGO will expand the scope of MO, drawing in large numbers of experimentalists and developers, and will draw upon the domain-specific knowledge of a wide range of biological and technical experts. 3.1. 2.2.3 D OMAIN

AN ALYSIS AN D KNOWLEDG E ACQUISIT ION

We hosted a series of meetings during which the domain experts discussed the terminology and structure used to describe nutrigenomics investigations. For us, domain analysis is an iterative process that must take place at every stage of the development process. We focused our discussions on specific descriptions about what the ontology should support, and sketched the planned area in which the ontology would be applied. Our goal was also to guide the knowledge engineer and involve that person in a more direct manner. An important outcome from this phase was an initial consensus reached on those terms that could potentially have a meaning for our intended users. The main aim of these informal 94

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

linguistic models was to build an explanatory dictionary; some basic relations were also established between concepts. We decided to use two separate tools (Protégé [30] and CMAP-tools [10]) because none of the existing Protégé plug-ins provided direct manipulation capabilities over the concepts and the relations among them the way CMAP-tools does. Additionally, we studied different elicitation experiences with CMs such as [31, 32]. Our knowledge formalism was Description Logic (DL), we used the Protégé OWL plug-in. CMs were used in two stages of our process: capturing knowledge, and testing the representation. Initially we started to work with informal CMs; although they are not computationally enabled, for a human they appear to have greater utility than other forms of knowledge representation such as spreadsheets or word processor tables. As the model gained semantic richness, by formalising ‘is-a’ and ‘whole/part-of’ relationships between the concepts the CMs evolved and became more complex. Using CMs, our domain experts were able to identify and represent concepts, and declare relations among them. We used CMAP-tools version 3.8 [10] as a CM editor. 3.1. 2.2.3.1 A TT RIBUTES

OF T H E D O M A IN E XP E R T S

Experts should of course be highly knowledgeable in their respective areas. We identified two kinds of nutrigenomics experts: high-level experts, scientists at a project coordination level involved in interdisciplinary efforts, and domain-specific experts, with extensive hands-on experience, experimentalists at a more technical level. When developing an ontology, it is also important to have experts with broad vision, so the flow of information could be captured and specific controlled vocabularies properly identified. 3.1. 2.2.3.2 T HE

KN OWLEDG E ELICIT AT ION SESS IONS

The goal of these sessions was to identify both the high-level and low-level domain concepts, why these concepts were needed, and how they could be related. A secondary goal was to identify reusable ontologies where possible. In the first sessions, it was important to see clearly the ‘what went where’, as well as the structure of the relationships that ‘glued’ the information together. We were basically working 95

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

with informal artefacts (CMs, word processor documents, spreadsheets and drawings); it was only at a later stage that we achieved some formalisation. Some sessions took place by teleconference; these were supported by iterative use of WEBEX (web, video, and teleconferencing software) [33] and Protégé. CMs were also used to present structural aspects of the concepts. We found it important to set specific goals for each teleconference, with these goals ideally specified as questions that are distributed prior to the meeting. In our case, most of the teleconferences focused on specific concepts, with questions of the form “how does A relate to B?”, “why do we need A here instead of B?”, and “how does A impact on B?”. Cardinality issues were also discussed. 3.1. 2.2.3.3 R EP RESEN T ING

CONCEPT UAL QUERI ES

We also used CMs to represent conceptual queries. We observed that domain experts are used to querying information systems using keywords, rather than building structured queries. In formalising the conceptual queries, CMs provided the domain experts with a tool that allowed them to go from an instance to the appropriate class/concept, at the same time identifying the relationships. For example, within the nutrigenomics domain some investigations study the health status of human volunteers looking at the level of zinc in their hair. These investigations may take place in different research institutes, but all the information may be stored in just one central repository. In order to correlate all those investigations the researcher should be able to formulate a simple query “what is the zinc concentration in hair across three different ethnic groups”. Figure 3 illustrates this query. Conceptually this query relates compounds, health function and ethnicity. The concept of compound implies a measurement; by the same token the concept of health function implies a particular part of the organism.

96

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 3 - Figure 3. CMs as means to structure a conceptual query.

Conceptual queries are based on high-level abstractions, relationships between concepts, concept-instances and logical operators; the selection of high-level abstraction allows the class to be instantiated. Conceptual queries provide a level of interaction between the user and the external sources, removing the need for the user to be aware of the schema. We do not want only to guide the user by allowing him/her to select concepts, but would also like to ask the user in a consistent and coherent way so the user can constrain the query before execution takes place, and/or navigate intelligently across terms. Thus, we see why we need an ontology ultimately and not simply a controlled vocabulary, nor merely a dictionary of terms. Controlled vocabularies per se describe neither relations among entities nor relations among concepts, and consequently cannot support inference processes [4]. The collected competency questions could be used as a starting point for building the conceptual queries. Competency questions are informal, whereas conceptual queries are used to identify the ‘class-relation-instance’ and thus improve the understanding of how users may ultimately query the system. Conceptual queries may be understood as a formalisation of competency questions. 3.1. 2.2.4 I T ERAT IVE

B UILD ING OF INFOR MAL ONTOLOGY MODEL S

Domain experts represented their knowledge in different CMs that they were generating. Their representation was very specific; they were providing instances and relating 97

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

these instances with very detailed whole/part-of relations. Figure 4 presents an example from the nutrigenomics domain that illustrates how we used the CMs in order to move from instances to classes, to identify is_a and defining the whole/part-of relationship more precisely.

Chapter 3 - Figure 4. Elicitation of Is_a, whole/part-of, and classes.

Initially, domain experts represented specific cases with instances rather than classes. The specificity of the use cases made it easy to identify a subject-predicate structure where subjects could be assimilated to instances. Alternatively, predicates in most of the cases had relations and/or information pointing to other ontologies that were needed. Subjects were understood as those entities that perform an action or who receive the action, whereas the predicate contains whatever may be said about the subject. By gathering use cases in the form of CMs, we could identify the classes and subclasses, for example: beverage is_a food, juice is_a non-alcoholic beverage. The has_attribute /is_ attribute_of property attached to the instance was also discussed. Moving from instances to classes was an iterative process in which domain experts were representing their knowledge by providing a narration full of instances, specific properties, and relationships. The knowledge engineer analysed all the material. By doing so, different levels of abstractions that could be used in 98

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

order to group those instances were identified; ultimately domain experts validated this analysis. 3.1.3

Future work As the nutrigenomics work contributes to the development of FuGO, the final steps -

formalisation and evaluation- will be possible only at a later stage, after our results (e.g. new concepts and/or structures) are evaluated and integrated into the structure of the functional genomics investigation ontology. However, we will continue to evaluate our framework with our nutrigenomics users and the other RSBI groups, to see if it accurately captures the information we need, and if our terminology and definitions are sufficiently clear to assist the annotation process. 3.1.3.1

Formalisation

Moving from informal models to formal models with accurate is-a and whole/part-of relationships will be done using Protégé. FuGO will also be developed in Protégé because it has a strong community support, multiple visualisation facilities, and it can export the ontology in different formats (e.g. OWL, RDF, XML, and HTML). Partly because Protégé and CMAP-tools are not currently integrated and partly because they aim to assist different stages during the process of developing an ontology, this has to be done, mostly, by hand. We envisage that integration of these two tools may help knowledge engineers in this process; semi-automated translation from CMs into OWL structures through the provision of assistance, in order to allow developers to formally encode bio-ontologies, would be desirable. Hayes and co-workers [34] addressed the problem of moving from CMs into OWL models. They extend CMAP-tools so it supports import and export of machine-interpretable knowledge formats such as OWL. Their approach assumes that the construction of the ontology starts from the CM and that the CM evolves naturally into the ontology. This makes it difficult for large ontologies where several CMs shape only a part of the whole ontology. Furthermore, adding asserted conditions (such as necessary, necessary and sufficient) was not possible; formalisation involves the encoding of the CM into a valid OWL structure by 99

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

identifying and properly declaring classes and properties. Based on those experiences in which we have used CMs, we are designing a tool that supports such transition. Difficulties arise from the divergence of syntactic formats between CMs and OWL models; CMs do not have logical constraints, whereas OWL structures are partially supported by them; the lack of connection between concepts as understood in CMs and OWL classes should also be noticed. During the elicitation process, the information gathered by means of CMs was usually incomplete in the sense that it tended to be too narrow –meaningful within the context of a particular researcher. Moreover, CMs were initially picturing processes and at later stages as they were gaining specificity the identification of terms and relationships was being enriched. All of these add to the difference between the information one could gather in a CM and an OWL model. They also emphasises the complementary relationship between one and the other. The node-arc-node structure of a CM may be assimilated to an RDF representation as well as to an embryonic OWL model. The proximity between both CMs and OWL models allows the arrangement of a CM directly into the syntactic structure of an OWL file thereby avoiding thus some of the inconveniences of translations between nonrelated models. The transition from a CM model to an OWL model may be made easier by allowing domain experts to develop parts of the ontology with the assistance of knowledge engineers. The assistance of the knowledge engineer should focus on the consistency of the whole/part-of properties in order to ensure orthogonality. Domain experts express in their CMs their different views of the world; the fragmentation of the domain of knowledge is mostly done by means of is-a relationship and whole/part-of properties. Once these properties and relationships are properly defined, combining complementary CMs may be much easier; also by doing so, the consistency of the OWL model may be assured. It will not be only by integrating CM functionality into Protégé that the knowledge acquisition process will be better supported and the formalisation/encoding of ontologies might be achieved more rapidly. It is also important to harmonise both CMs and OWL models syntactically and semantically. The construction of the class hierarchy should be done 100

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

in parallel with the definition of its properties. This will allow us to identify potential redundancies and inconsistencies in the ontology. Domain analysis will thus be present throughout the whole development process. 3.1.3.2

Evaluation

Before putting the ontology into use, we will need to evaluate how accurately it could answer our competency questions and conceptual queries. To accomplish this, we will use CMs as well as some functionalities included in Protégé. Because our CMs represent the conceptual scaffold of the knowledge we are representing, we will use them to evaluate how this discourse may be mapped into the concepts and relationships we have captured. The rationale behind this is simple: the concepts and relationships, if accurate, may then be mapped into the actual discourse. By doing this we hope to identify: • Where the concepts are not linguistically clear. • Whether any redundancies are present. • Whether the process has been accurately represented both syntactically and semantically. We envisage a simple structure for our validation sessions: domain experts will be presented with the CM, and asked to map their narration into that CM. Minimal or no help should then be given to the domain expert. The use of CMs as a narrative tool for evaluation of ontologies has not to our knowledge been reported previously. Further research into this particular application of CMs may be valuable. Ultimately the ontology may also be evaluated by using the PAL (Protégé Axiom Language) plug-in provided by Protégé. PAL allows the construction of more-sophisticated queries. Among those methods described by [35] we checked the consistency using only RACER [36].

101

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

3.1.4

Discussion Building ontologies is a non-trivial task that depends heavily on domain experts. The

methodology presented in this paper may be used in different domains with scenarios similar to ours. We used concept maps at different stages during this process, and in different ways. The beauty of CMs is that they are informal artefacts; introducing formal semantics into them remains a matter for further investigation. The translation from CMs to OWL remains manual, and we acknowledge that some information may be lost, or even created, in this step despite the constant participation of domain experts. An ideal ontology development tool would assist users not only during knowledge elicitation, as CMAP-tools does well, but also during the formalisation process, so that everything could be done within one software tool. On the ‘art of questioning’: When to ask? How to ask? How to intervene in a discussion without taking sides? These are some of the considerations the elicitor must bear in mind during the sessions. When to ask? Basically he/she should ask only when the discussion is not heading in the direction of answering the stated question. How to ask? The question may be stated as a direct question, or as a hypothesis in the form of, ‘if A happens then what happens to B?’, ‘what is the relationship between A and B?’, ‘what are the implications A may have over B?’. The knowledge engineer should ideally intervene in discussions as little as possible. The experts are presented with an initial scenario or question, after which their discussion takes place so knowledge can start to be elicited. CMs proved to be a very powerful tool for constraining the discussions in a consistent way. Unfortunately, too little attention has been paid in the bio-ontological literature to the nature of such relations and of the relata that they join together [8]. This is especially true for ontologies about processes. OBO provides a set of guidelines for structuring the relationships, as well as for building the actual ontology. We are considering these and will follow these guiding principles in our future development. We will also consider the issue of orthogonality very carefully, as we have always thought about those ontologies that could, at a later stage, be integrated into our proposed structure.

102

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Currently, knowledge is commonly exchanged via email, WIKI pages and teleconferences. Where this may still work for closely related groups or when working within a well-defined domain, we have demonstrated in this paper that CMs could effectively assist both domain experts and the knowledge engineer, and provide a basis for properly visualising the argument and its follow-ups. Tempich and co-workers addressed some of these issues by proposing an argumentation ontology for distributed, loosely-controlled and evolving engineering processes [16, 37]. The development of an ontology for Genealogy Management Systems (GMS) was another scenario in which our methodology was applied during the knowledge elicitation process [38]. This was a slightly different scenario because our domain experts were mostly in one place. The GMS ontology is meant to partially support annotation of germoplasm throughout the entire transformation process that takes place in several research institutes. CMs were here initially used in order to represent those different transformation processes, and at a later stage CMs, in combination with semi-automatic terminology extraction algorithms, were also used in order to capture and organise vocabulary. The combination of CMs and these semi-automatic methods for terminology extraction proved to be quite useful; initially domain experts were presented with lists of terms, and were later requested to organise them using CMs. During the development of the GMS ontology, a narrative approach was also investigated in conjunction with semi-automatic text extraction methods. The approach taken was simple: domain experts were asked to build stories as they were providing vocabulary. Empirical evidence from this experience suggests that CMs may provide us with a framework for larger terminology extraction and validation efforts. A paper describing these experiences is in preparation. Despite the differences between those domains, the CMs proved to be useful when capturing and sharing knowledge, both as an external representation of the topic being discussed, and as an organisational method for knowledge elicitation. It should be noticed; however, that only time will tell about the transposability of this methodology into other domains. 103

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

3.1.5

Conclusions We have focused our efforts on knowledge elicitation within the nutrigenomics

community. We present a methodology for building ontologies and report our experiences during the knowledge elicitation phase in particular. An informal evaluation of the knowledge elicitation sessions suggests strong commonalities with the argumentative structure proposed by several authors [21, 16, 37]. We identify the need for further research on how to manage this arrangement. For instance, it could be desirable to track discussions in a more structured and conceptual manner rather than browsing through a vast set of emails. The structure of discussions over ontologies may follow a pattern. We consider that structuring discussions requires technology to be able to provide some cognitive support to users, not only to post their comments but also to follow and search the threads. Having provided evidence for the applicability of our methodology, it would be interesting to see how it can be extended and better supported by software tools such as Protégé. Those general-purpose collaborative development environments focus more on technical aspects such as consistency and version control rather than on the actual act of the collaboration. Collaborative environments such as WIKIs or version-control software (e.g. configuration management software) do not support ontology development in any special way. Recent developments of Protégé, such as the one proposed by [39] and [19], are an interesting step in the right direction; however too little attention has been placed on the actual process of collaboration when building ontologies within decentralised environments. Diaz and co-workers [39] have developed a tool that provides some extended multi-user capability, sessions, and a versioning control system. Building ontologies in which domain experts are informants and, at the same time, leaders of the process is, however, a more complex process that requires more than just a tool in which different users may edit and work on the same file. Hayes and collaborators [19] provide an extension to CMAP-tools in which CMs may be saved as an OWL file. However, it proved to be difficult to read these files in Protégé due to some inconsistencies in the generated OWL structure; unfortunately this extension does not provide a way in which it is possible to fully exploit DL. [34] Both 104

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Hayes and [39] Diaz, propose interesting solutions. However, we consider collaboration emerges naturally when domain experts are provided with the tools that allow them to represent and share their knowledge in such a way that it is easy to promote and support discussion and concentrate on concepts and constraints. There is a need to support collaborative work from the perspective of allowing users to make use of a virtual working place; cognitive support is therefore needed. The design and development of such a collaborative environment and an accompanying CM plug-in for Protégé that supports both the knowledge acquisition phase and the translation from the CM to an OWL structure are clearly desirable. The development of this plug-in, as well as a more comprehensive collaborative environment, is currently in progress. Ontologies are constantly evolving, and the conceptual structures should be flexible enough as to allow this dynamic. It is important to report methodological issues (or just “methodology”) as part of those papers presenting ontologies, in a section analogous to the “methods and materials” sections required in experimental papers. The added clarity and rigour that such presentation would bring would help the community extend and better adapt existing methodologies, including the one we describe here. 3.1.6

Acknowledgements We gratefully acknowledge our early discussions with Jennifer Fostel and Norman

Morrison, leaders of the toxicogenomics and environmental genomics communities within MGED RSBI. We thank Ruan Elliot (Institute of Food Research) and Anne-Marie Minihane (Reading University) for their expertise in nutritional science. We also acknowledge Mark Wilkinson, Oscar Corcho, Benjamin Good, and Sue Robathan for their comments. Finally, we thank Mark Green (EBI) for his constant support. This work was partly supported by the student exchange grants of the EU Network of Excellence NuGO (NoE 503630) to SAS, the EU Network of Excellence Semantic Interoperability and Data Mining in Biomedicine (NoE 507505) to RS, and Australian Research Council grant (CE0348221) to MAR.

105

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

3.1.7

References

1.

Quackenbush J: Data standards for 'omic' science. Nature Biotechnology 2004, 22:613-614.

2.

Field D, Sansone, SA: A special issue on data standards. OMICS: A Journal of Integrative Biology 2006 (in press).

3.

Blake J: Bio-ontologies—fast and furious. Nature Biotechnology 2004, 22:773-774.

4.

Garcia Castro A, Chen YP, Ragan MA: Information integration in molecular bioscience: a review. Applied Bioinformatics 2005, 4(3):157-173.

5.

Corcho O, Fernandez-Lopez M, Gomez-Perez A: Methodologies, tools, and languages for building ontologies. Where is their meeting point? Data and Knowledge Engineering 2002, 46(1):41-64.

6.

Smith B, Rosse C: The Role of Foundational Relations in the Alignment of Biomedical Ontologies. Amsterdam: IOS Press; 2004.

7.

Rosse C, Kumar A, Mejino J, Cook D, Detwiler L, Smith B: A Strategy for Improving and Integrating Biomedical Ontologies. In: American Medical Informatics Association 2005 Symposium: 2005; Washington DC; 2005: 639-643.

8.

Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C: Relations in Biomedical Ontologies. Genome Biology 2005, 6(5):R46.

9.

Lopez F, Perez G, Sierra J, Pazos S: Building a Chemical Ontology Using Methontology and the Ontology Design Environment. IEEE Intelligent Systems & Their Applications 1999, 14(1):37-46.

10.

CmapTools [http://cmap.ihmc.us/]

11.

Uschold M, King M: Towards Methodology for Building Ontologies. In: Workshop on Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95: 1995; Cambridge, UK; 1995.

12.

Fox M: The TOVE Project: A Common-sense Model of the Enterprise Systems. In: Industrial and Engineering Applications of Artificial Intelligence and Expert: 1992: Springer-Verlag; 1992: 25-34.

106

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

13.

Gruninger M, Fox MS: The Design and Evaluation of Ontologies for Enterprise Modelling. In: Workshop on Implemented Ontologies, European Workshop on Artificial Intelligence: 1994; Amsterdam, NL; 1994.

14.

Uschold M: Building Ontologies: Towards a Unified Methodology. In: 16th Annual Conf of British Computer Society Specialist Group on Expert Systems,: 1996; Cambridge, UK; 1996.

15.

Uschold M, Gruninger M: Ontologies: Principles, methods and applications. Knowledge Engineering Review 1996, 11(2):93-136.

16.

Vrandecic D, Pinto H, Sure Y, Tempich C: The DILIGENT Knowledge Processes. Journal of Knowledge Management 2005, 9(5):85-96.

17.


18.

Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools for Ontologies. The Agricultural Ontology Service (UN FAO) 2003.

19.

Hayes P, Eskridge CT, Saavedra R, Reichherzer T, Mehrotra M, Bobrovnikoff D: Collaborative Knowledge Capture in Ontologies. In: K-CAP 05: 2005; Banff, Canada; 2005.

20.


21.

Pinto H, Staab S, Tempich C: Diligent: towards a fine-grained methodology for Distributed, Loosely-controlled and evolving engineering of ontologies. In: European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

22.


23.

Canas A, Leake DB, Wilson DC: Managing, Mapping and Manipulating Conceptual Knowledge,. In: AAAI Workshop Technical Report WS-99-10: Exploring the Synergies of Knowledge Management & Case-Based Reasoning. Menlo California: AAAI Press; 1999.

24.

European Nutrigenomics Organisation [http://www.nugo.org]

107

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

25.

Sansone SA, Rocca-Serra P, Tong W, Fostel J, Morrison N: A strategy capitalizing on synergies - The Reporting Structure for Biological Investigation (RSBI) working group. . OMICS: A Journal of Integrative Biology 2006 (in press).

26.

Whetzel P, Brinkman RR, Causton HC, Fan L, Fostel J, Fragoso G, Heiskanen M, Hernandez-Boussard T, Morrison N, Parkinson H, Rocca-Serra P, Sansone SA, Schober D, Smith B, Stevens R, Stoeckert C, Taylor C, White J, and members of the communities collaborating in the FuGO project: Development of FuGO: an Ontology for Functional Genomics Investigations. OMICS: A Journal of Integrative Biology 2006 (in press).

27.

Whetzel P, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen M, Morrison N, Rocca-Serra P, Sansone SA, Taylor C, White J, Stoeckert CJ Jr: The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 2006, 22(7):866-873.

28.

Open Biomedical Ontologies [http://obo.sourceforge.net/]

29.

Rubin D, Lewis S, Mungall C, Misra S, Westerfield M, Ashburner M, Sim I, Chute C, Solbrig H, Storey M, Smith B, Day-Richter J,,Noy NF and Musen M: The National Center for Biomedical Ontology: Advancing Biomedicine through Structured Organization of Scientific Knowledge. OMICS: A Journal of Integrative Biology 2006 (in press).

30.

Noy N, Fergerson R, Musen M: The knowledge model of Protege-2000: Combining interoperability and flexibility. In: 2th International Conference on Knowledge Engineering and Knowledge Management (EKAW'2000): 2000; Juan-les-Pins, France; 2000.

31.

Briggs G, Shamma DA, Cañas AJ , Carff R, Scaargle J, Novak JD: Concept Maps Applied to Mars Exploration Public Outreach. In: Proceedings of the First International Conference on Concept Mapping: 2004; Pamplona, Spain; 2004.

32.

Leake D, Maguitman A, Reichherzer T, Cañas A, Carvalho M, Arguedas M, Brenes S, Eskridge T: Aiding Knowledge Capture by Searching for Extensions of Knowledge Models. In: Proceedings of K-CAP: 2003; Sanibel Island, Florida, USA; 2003.

33.

WEBEX [http://www.webex.com/]

108

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

34.

Hayes P, Saavedra R, Reichherzer T: A collaborative development environment for ontologies. In: Semantic Integration Workshop: 2003; Sanibel Island, Florida, USA; 2003.

35.

Seipel D, Baumeister J: Declarative Methods for the Evaluation of Ontologies. Künstliche Intelligenz 2004:51-57.

36.

Haarslev V, Möller R: Racer: A Core Inference Engine for the Semantic Web. In: Proceedings of the 2nd International Workshop on Evaluation of Ontology-based Tools (EON2003): October 20 2003; Sanibel Island, Florida, USA; 2003: 27-36.

37.

Tempich C, Pinto H, Sure Y, Staab S: An Argumentation Ontology for DIstributed, Loosely-controlled

and

evolvInG

Engineering

processes

of

oNTologies

(DILIGENT). In: Second European Semantic Web Conference: 2005; Greece; 2005: 241--256. 38.

GMS Ontology [http://cropwiki.irri.org/icis/index.php/Germplasm_Ontology]

39.

Diaz A, Baldo G: Co-Protege: A Groupware Tool for Supporting Collaborative Ontology Design with Divergence. In: 8th International Protege Conference: 2005; Madrid, Spain; 2005: 32-32.

109

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

3.2

THE USE OF CONCEPT MAPS FOR TWO ONTOLOGY DEVELOPMENTS: NUTRIGENOMICS,

AND

A

MANAGEMENT

SYSTEM

FOR

GENEALOGIES.

Abstract. We briefly describe the methodology we have adopted in order to develop ontologies. Because our scenarios involved domain experts distributed geographically, the domain analysis and knowledge acquisition phases used different independent technologies that were not always integrated into the Protégé suite. Groupware capabilities were thus achieved. From these experiences we identify conceptual maps (CMs) as an important collaborative and knowledge acquisition tool for the development of ontologies. Direct manipulation and collaborative facilities that currently exist in Protégé can be improved with those lessons learnt from this and similar experiences. Here we describe our scenario, competency questions, results, and milestones for each methodological stage, use of CMs, and vision for a collaborative environment for ontology development. This presentation is based on two different sets of experiences, one within nutrigenomics and the other one in plant genealogy management systems.

3.2.1

Introduction When developing an ontology involving geographically distributed domain experts, the

domain analysis and knowledge acquisition phases may become a bottleneck due to difficulties in establishing a formal means of communication (i.e. in sharing knowledge). Conceptual maps (CMs) have been demonstrated to be an effective means of representing and communicating knowledge [1]. Traditionally, ontologies have been built by highly trained knowledge engineers with the assistance of domain specialists. It is a time-consuming and laborious task. Ontology tools are available to support this work, but their use requires training in knowledge representation and predicate logic [2]. Bio-ontologies are developed primarily by biologists. Domain experts are rarely available in one place, so the development of bio-ontologies is usually a distributed effort in which teleconferences, email, commentary-tracking systems, and videoconferences are used at all stages. During our ontology building efforts, we identified the lack of an integrated environment in which at least some of these technologies come together to facilitate both knowledge representation and sharing as a major bottleneck. CMs may help to overcome these issues. 110

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Conceptual maps are graphs that consist of nodes, with connecting arcs that represent relationships between nodes [3]. The nodes are labeled with descriptive text representing the "concept", and the arcs are labeled (sometimes only implicitly) with a relationship type. We used CMs in two stages of our process, the capture of knowledge and testing the structure of the ontology. Initially we started to work with informal CMs; although they are not computationally enabled, for humans they appear to have greater "computational efficiency" than other forms of knowledge representation, e.g. EXCEL™ spreadsheets or Microsoft Word™ tables. As our models gained semantic richness, the CMs evolved and became more complex by formalising the knowledge in our ontologies. We found that the CMs made it possible for domain experts to identify and represent concepts, and to declare relations among them. More importantly, they helped clarify the difference between the ontological model, ER (Entity relationship) models and the possible object model (OM). For biologists, ontologies have a concrete representation in dictionaries, whereas they view object models as being more related to implementation. Implementation details were thus separated from ontologically related issues. We used CMAP (http://cmap.ihmc.us/) [1] as a CM editor. The ontologies we are developing are asymmetric and complementary. In one we want to ease the process of accurately capturing nutrigenomics data via web-forms, whereas on the other hand we want to facilitate the building of queries over large genealogy databases (http://cropwiki.irri.org/icis/index.php/Germplasm_Ontology). They are two different experiences with similar problems, and a common bottleneck, knowledge acquisition. From both ontologies we identified the importance of cognitive support over the groupware facility. This paper is organised as follows. In Section 2 we present our methodology, and describe how we used CMs not only to capture knowledge, but also to share it in a distributed environment. Section 3 presents the development of a CM plug-in for Protégé. Brief discussions, conclusions, and an outline of our future work, are presented in Section 4.

111

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

3.2.2

Methodology For our particular purposes we decided to adapt some previously reported

methodologies in order to enable communication among domain experts and with the ontologist, effectively reuse other ontologies, and provide to the extent possible a high-level conceptual scaffold so other ontologies could be integrated later. We extended the methodology proposed by Mirzaee et al. [4]. Figure 5 schematises the methodology we followed.

Chapter 3 - Figure 5. Methodology, milestones, and phases. Domain analysis is a process in which information used in a particular domain is identified, captured, and organised for the purpose of making it reusable. We hosted a series of meetings during which domain experts agreed on terminology, and on how to structure the reporting of an investigation. We view domain analysis as an iterative process, taking place at every stage. We focused our discussions on specific descriptions of what the ontology should support, and sketched the intended area of application that the ontology was to capture. Our goal was also to guide an ontology engineer, and involve him or her in a more direct manner;

112

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

so we also made decisions about inclusion, exclusion and the first draft of the hierarchical structure of concepts in the ontology. An important outcome from this phase was the consensus that we reached on terms that could potentially have a meaning for our intended users. The main aim of these informal linguistic models was to build an explanatory dictionary; some basic relations were, as well, established between concepts. We built different models throughout our analyses of available knowledge sources and information gathered in previous steps. First a “baseline ontology” was gathered, i.e. a draft version containing few but seminal elements of an ontology. Typically, the most important concepts and relations were identified somewhat informally. We could assimilate this “baseline ontology” into a taxonomy, in the sense of a structure of categories and classifications. We consider a taxonomy as “a controlled vocabulary which is arranged in a concept hierarchy”, and ontology as “a taxonomy where the meaning of each concept is defined by specifying properties, relations to other concepts, and axioms narrowing down the interpretation”. As the process of domain analysis and knowledge acquisition evolves, the taxonomy takes the shape of an ontology. During this step, the ontologist worked primarily with only very few of the domains experts; the others were involved in weekly meetings. In this phase the ontologist sought to provide the means by which the domain experts he or she was working with could express their knowledge. Some deficiencies in the available technology were identified, and for the most part were overcome by our use of CMs. For subsequent steps (i.e. formalisation and evaluation), different needs may be identified. 3.2.3

CM plug-in for Protégé Our knowledge acquisition phase took place in different stages, for some of which the

domain experts were not together. CMs proved very useful in facilitating the visualisation and discussion, and in providing domain experts with a tool that could be used to declare the primary elements of their knowledge. OWLviz [5] was initially tested to support domain 113

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

experts in this task, but this plug-in did not provide direct manipulation (DM) capabilities over the concepts and the relations among them. We also tested Jambalaya [6] before deciding to use two separate tools (i.e. Protégé [7] and the CMAP tools). Since CMs support the declaration of nodes and relationships, it was easy to assimilate these to classes and properties. The conversion was a straightforward, albeit manual, process. The main feature we identified from our work with CMs was the DM capability provided to us by the software. This functionality had several advantages, which we list below. Interesting, all of these advantages had previously been identified by Shneiderman: • Novices can learn basic functionality quickly, usually through a demonstration by a more experienced user. • Experts can work extremely rapid to carry out a wide range of tasks, even defining new functions and features. • Knowledgeable intermittent users can retain operational concepts. • Error messages are rarely needed. • Users can see immediately if their actions are furthering their goals; if not, they can simply change the direction of their activity. • Users have reduced anxiety because the system is comprehensible and because actions are so easily reversible. We are currently starting to develop the CM plug-in. Basically, it facilitates the declaration properties and classes, writing to the OWL file. Some of the formal requirements we have identified for our plug-in are: • Graphic manipulation of classes and properties via contextual menus. • Direct publication over the web of the CMs we generate. • Drag-and-drop capabilities. • Relationship between concepts and their concrete representations, and annotation features (e.g. text, colors, graphics, and even files). • Manipulation of the same file by different users, with a mechanism to track changes. • Availability of a chat window. • Possibility for moderated or un-moderated sessions. This is particularly important for situations in which more than four people are working online on the same file. • The user interface should be non-intrusive. 114

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

• The user should be presented with an empty canvas on which concepts, linking phrases and properties can be declared by a direct click. 3.2.4

Conclusions and future work. Since our methodology involves participatory design activities, it is important for the

tool to support this range of activities. We consider that CMs may play a crucial role in assisting users in these activities. Our development is inheriting many of those features already available in CMAPTOOLS; we are extending it so we may additionally also allow users to “discuss” on-line, while at the same time manipulating the OWL file. We are thus extending the capabilities currently available in Protégé, not just to enhance browsing but more deeply to promote a collaborative environment for the development of ontologies. Since protégé was mainly developed as a desktop tool its web implementation lacks some group-ware features. In order to implement an integrated web-ontology development environment HumanComputer Interaction studies need to be conducted. 3.2.5

Acknowledgements The authors would like to thank Robert Stevens and Karim Nashar for the useful

discussions and collaboration. A. Garcia is supported by Institute for Molecular Bioscience, Australian Centre for Plant Functional Genomics, the ARC Centre in Bioinformatics and the EMBL-EBI. SA Sansone is supported by the ILSI-HESI Genomics Committee and Philippe Rocca-Serra by the European Commission NuGO project. 3.2.6

References

1.

Canas., A.J., et al. CMAPTOOLS: A knowledge modeling and sharing environment. In Concept Maps: Theory, Methodology, Technology. 2004. Universidad Pública de Navarra: Pamplona, Spain: Universidad Pública de Navarra.

2.

Seongwook., Y., et al., Survey about ontology development tools for ontology-based knowledge Management. 2003.

3.

Lambiotte., J.G., et al. Multi-relational semantic maps. Educational Psychology review, 1989. 1(4): p. 331-367.

115

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

4.

Mirzaee, V., L. Iverson, and B. Hamidzadeh. Towards ontological modelling of historical documents. In The 16th International Conference on Software Engineering and Knowledge Engineering (SEKE). 2004.

5.

Knublauch., H., OWLviz a visualisation plugin for the Protege OWL plugin., http://www.co-ode.org/downloads/owlviz/.

6.

Storey., M.-A., et al. JAMBALAYA: Interactive visualization to enhance ontology authoring and knowledge acquisition in Protege. In Workshop on Interactive Tools for Knowledge Capture, K- CAP-2001. 2001. Victoria, B.C. Canada.

7.

Geniari., H., John.,, et al., The evolution of Protégé: An environment for knowledge-based systems development. International Journal of Human Computer Studies, 2003. 58(1): p. 89-123.

116

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Cognitive support for an argumentative structure during the ontology development process The importance of conceptual maps as well as their use was largely studied in those experiences reported in chapters 3 and 5. Although the benefits of concept maps were well understood, it was also clear that in order to better-facilitate the communication amongst domain experts and with the knowledge engineer it was important to have an argumentative structure. Interaction amongst domain experts generates large amounts of data and information, not always usable or understandable by the knowledge engineer; for this matter conceptual maps could be used. This chapter addresses the problem of supporting the argumentative structure that was the result of the interaction amongst domain experts; it also studies the argumentative structure within the context of developing ontologies within decentralised settings. The main contribution of this paper is not only to present a practical use for argumentative structures, but also to support this structure by means of conceptual maps. In this chapter the use of concept maps is proposed as a mean to support and scaffold an argumentative structure during the development of ontologies within loosely centralised communities. This novel use of conceptual maps had not been previously studied. The author conceived and coordinated the project. The proposed use of conceptual maps, as well as the extensions for the argumentative structure was the product of the analysis the author carried out during those experiences reported in this thesis. Alex Garcia wrote the published paper based on this chapter.

117

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

AUTHORS' CONTRIBUTIONS Alex Garcia Castro conceived and coordinated the project, he also wrote the manuscripts for this paper. Angela Noreña and Andrés Betancourt were domain experts in the knowledge elicitation exercises and also assisted Alex Garcia Castro in the implementation of the first version of the plug-in. Mark A. Ragan supervised the project, and assisted Alex Garcia Castro in the preparation of the final manuscript.

PUBLISHED PAPER ARISING FROM THIS CHAPTER Garcia Castro A: Cognitive support for an argumentative structure during the ontology development process. In: 9th Intl Protégé Conference: July, 2006; Stanford, CA, USA; 2006.

118

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

4

Chapter IV - Cognitive support for an argumentative structure during the ontology development process

Abstract: Structuring and supporting the argumentative process that takes place within the knowledge elicitation process is a major problem when developing ontologies. Knowledge elicitation relies heavily on the argumentative process amongst domain experts. The involvement of geographically distributed domain experts and the need for domain experts to lead the design process, adds an interesting layer of complexity to the whole process. We consider that the argumentative structure should facilitate the elicitation process and serve as documentation for the whole process; it should also facilitate the evolution and contextualisation of the ontology. We propose the use of concept maps as means to support and scaffold an argumentative structure during the development of ontologies within loosely centralised communities.

4.1

INTRODUCTION

The applications of knowledge engineering are growing larger and more systematic, now encompassing more ambitious ontologies—sizes in the hundreds of thousands of concepts will not be uncommon [1]. Furthermore, the development of those ontologies is usually a participatory exercise in which different experts interact via virtual means, resembling thereby a loosely centralised community. We believe the requirements of the Semantic Web (SW) bring with it an associated need for enhanced cognitive support in those tools we use. Cognitive support is used to leverage innate human abilities, such as visual information processing, to increase human understanding and cognition of challenging problems [2]. Developing ontologies in loosely centralised environments as those described by Pinto et al. [3] poses challenges not previously considered by most existing methodologies. This usercentric design relies heavily on the ability of domain experts to interact with each other and with the knowledge engineer. By doing so the ontology evolves. Mailing lists, web forums, and WIKI pages usually support this interaction. Despite this combination of tools (none of them an ontology editor per se, nor a knowledge engineering tool), information is lost, documentation is poorly structured, and the process is not always easy to follow. This results in a decreased participation by the domain experts. 119

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

One of the key components in the development of ontologies in loosely centralised environments is the discussion related to each and every term and relationship/property. Pinto et al., as well as Tempich et al. [3, 4] have proposed an argumentative structure to support and facilitate the discussion within the process of developing ontologies in loosely centralised environments. Both Garcia et al. and Hayes et al. [5, 6], have studied the use of CMs during the elicitation process when developing ontologies in distributed environments. However, it is not clear how to support the proposed structure, nor what is the role of the argumentative process within the development of the ontology. The knowledge elicitation process, part of the whole ontology development, is a major bottleneck, particularly within those communities in which domain experts are geographically distributed. In order to assist the elicitation process and improve the interaction we propose the use of CMs as a means to scaffold the argumentative structure. This paper is organised as follows. Firstly we provide some background information, and present our approach to the problem of supporting argumentative structures. We explain in Section 2 what is an argumentative structure within the context of ontology development, we also present in this section the relationship between a CM and the argumentative structure proposed by Tempich et al. [4]. In Section 3 we present our CM plug-in for Protégé and elaborate further how our plug-in supports, assists and facilitates the argumentative process. We present a brief discussion and conclusions in Section 4.

4.2

ARGUMENTATIVE STRUCTURE AND CMS

Central to ontology development is the process by which domain experts and the knowledge engineer argue about terms/types and relationships. This collaborative interaction generates threads of arguments [3, 4, 7], and there is a need to support the evolution and maintenance of this argumentative process in a way that makes it easy to follow and, more importantly, links to evidence and provides room for conflicting points of view. Figure 1 presents the argumentative structure proposed by [4]. 120

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 4 - Figure 1. The major concepts of the argumentation ontology and their relations. Reproduced with permission from [4]

CMs are semantically valid artefacts without OWL constraints; concepts and relationships are the main scaffold of a CM. At any given point during the argumentative process one has a concept/class and a relationship/property. The evolution of the discussions increases the amount of information attached to the concept or relationship, the argumentative structure is enriched as domain experts provide arguments and base them upon evidence, which may be a paper, a commentary, or more generally a file of any kind (e.g. information source). The different views of the world can be represented with a CM, and the evidence may be attached to the particular concept/class or relationship/property at hand. This graphic representation facilitates the continuous exchange of information amongst 121

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

domain experts –sharing knowledge. Following the threads of the discussions is not always easy for domain experts. The information exchanged is usually structured as an email-based chat. The knowledge engineer has to follow these text-based discussions in which there is mostly verbal knowledge, filter them, and at some point “formalise” that implicit knowledge. Moving from verbal knowledge into formalised-shared knowledge is difficult; some information is usually lost, the evidence supporting those different positions is not always provided by domain experts, and most importantly keeping domain experts engaged throughout the entire process is not always possible. Cognitive support is thus required so we may facilitate the useful flow/exchange of information and at the same time record the entire process.

4.3

ARGUMENTATION VIA CMS

Concepts and relationships resemble the two key components within an argumentative structure: arguments and positions. During the development process we argue in relation to a concept and/or a relationship. Positions are supported upon evidence, and the simple argumentative structure is by itself a particular view of the world that is being modelled. Figure 2 illustrates the basics behind the relationship between CMs and an argumentative structure.

122

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 4 - Figure 2. A simplification of the argumentative structure presented by Tempich et al. The pizza example (http://www.co-ode.org) is used in order to illustrate our simplified argumentative structure

For any given issue there is an argument that is elaborated by presenting the conflicting positions. The elaboration provides instances --concrete examples. For any issue

123

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

there is a concertation2 process that presents argument-elaborated conflicting positions. Once a consensus is reached there is a position on the issue initially at hand. The issue is well focused and specific, the same is true for the argument. It supports a position with simple and few words whereas the elaboration of the argument tends to be larger, and supported by different files (e.g. pdf, ppt, doc, xls). Although there may be more than one argument for any given issue, there is only one elaboration for each argument. The dispute resolution process (also known as conciliatory process) produces a position on the particular issue; with in this process the knowledge engineer acts as a facilitator. Discussions over terminology, and over conceptual models, tend to address one issue at a time, this is highly dependent on the knowledge engineer. As the ontology grows, so does the complexity of the information available for each and every component of the ontology ( e.g. classes, properties, instances). Although having an ontology that represents the structure of the argumentative process helps the knowledge engineer in the classification of the information, in order for the evidentiary material to be useful it needs to be attached to the relevant piece of the ontology. For instance, when discussing about “biomaterial” within the development of a laboratory information management system for functional plant genomics one feasible starting point for the discussion would be to adopt the same understanding of biomaterial as it is available in the microarray ontology.

2

Concertation. From the French concertation. A conciliatory processes by which two parts reach an agreement.

124

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

class BioMaterial definition: Description of the processing state of the biomaterial for use in the microarray hybridisation. superclasses: BioMaterialPackage known subclasses: BioSample BioSource LabeledExtract properties: unique_identifier MO_226 class_role abstract class_source mage constraints: restriction: has_type has-class MaterialType restriction: has_biomaterial_characteristics has-class BioMaterialCharacteristics.

Chapter 4 - Figure 3. Biomaterial from MGED As defined by http://mged.sourceforge.net/ontologies/MGEDontology.php#BioMaterial

When discussing this term domain experts considered there was the need o have a more general meaning. Domain experts proposed not only different meanings but also they identified properties and instances at the same time they were providing the knowledge engineer with competency questions, and relevant scenarios in which they considered the term was going to be used. Furthermore, domain experts were discussing the relationship between biomaterial and biosample. In order to gather all these information in a structured and usable manner conceptual maps proved to be very useful. Not only domain experts were able to follow the argumentative structure without even being aware of this, but also the process was being documented in a way in which it was both, easy for domain experts to exchange information, and easy for the knowledge engineer to assist domain experts in the process. A very important part of the whole process is the management of the history. Tracing back the argumentation process from the position_on_issue to the elaboration for a particular argument; being able to “see” the argumentative structure in order to “stand” on a particular place. The history should also allow us to go back and take an alternative route, thus we see the history not just as a simple undo” but as a more complex feature.

125

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

4.4

DISCUSSION AND CONCLUSIONS

For any given issue there is an argument that is elaborated by presenting the conflicting positions. The elaboration provides instances -concrete examples. For any issue there is a process that presents argument-elaborated conflicting positions. Once a consensus is reached there is a position on the issue initially at hand. The issue is well-focused and specific, the same is true for the argument. It supports a position with simple and few words, whereas the elaboration of the argument tends to be larger and supported by different files ( e.g. pdf, ppt, doc, xls). Although there may be more than one argument for any given issue, there is only one elaboration for each argument. The dispute-resolution process (also known as the conciliatory process) produces a position on the particular issue; within this process the knowledge engineer acts as a facilitator. Discussions over terminology, and over conceptual models, tend to address one issue at a time and this is highly dependent on the knowledge engineer. Within this context conceptual maps provided a scaffold upon which the argumentative ontology may be instantiated.

4.5

1.

REFERENCES

Ernst A, Neil. , Storey M-A, Allen P: Cognitive support for ontology modeling. Int J Human-Computer Studies 2005, 62:553–577.

2.

Walenstein A: Cognitive support in software engineering tools: a distributed cognition framework. Simon Fraser University; 2002.

3.


126

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

4.

Tempich C, Pinto HS, Sure Y, Staab S: An Argumentation Ontology for DIstributed, Loosely-controlled

and

evolvInG

Engineering

processes

of

oNTologies

(DILIGENT). In: Second European Semantic Web Conference: May 2005; Greece; May 2005: 241--256. 5.

Garcia Castro A., Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S: The use of concept maps during knowledge elicitation in ontology development processes - the nutrigenomics use case. BMC Bioinformatics 2005, 2006, 7:267.

6.

Hayes P, Eskridge CT, Saavedra R, Reichherzer T, Mehrotra M, Bobrovnikoff D: Collaborative Knowledge Capture in Ontologies. In: K-CAP 05: 2005; Banff, Canada; 2005.

7.


127

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Narratives and Biological investigations This chapter has two sections, the first one “The use of narratives and text-mining extraction techniques to support the knowledge elicitation process during two ontology developments” presents an insightful study about the narratives that were being gathered during the knowledge elicitation process. Conceptual maps could be used to support the argumentative structure; they were also quite useful when eliciting knowledge. However, eliciting knowledge was not always a straightforward question-answer process. Very often domain experts were building narratives in order to explain in a more illustrative manner their scenarios to the knowledge engineer. Moreover, it was observed that once baseline ontology was built domain experts tended to support their discussions on these narratives. Empirically they were using conceptual maps as they were “drawing” their ideas. Although from these narratives some instances were being gathered, it was important to better frame the elicitation exercises. How could the narratives and the elicitation exercises be better framed as well as how could these narratives be better used and supported when eliciting knowledge? These are the two main issues this section addresses. The second section of this chapter, “A proposed semantic framework for reporting OMICS investigations” addresses the issue of describing biological investigations, “How to provide semantics for upper-level elements relevant to the representation and interpretation of omicsbased investigations?” This section presents an upper level ontology for the representations of biological investigations. The experience here reported was useful as it was important for the author to test the proposed methodology with domain experts from different disciplines (Nutrigenomics, Toxicogenomics, Environmental Genomics); it was equally important to study how these domain experts were reaching their consensuses after debating on their conceptual models. Chapter 7 follows up on the issue of describing biological investigations, not from the semantic perspective but by studying practical issues when describing these

128

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

investigations. The ontology described in this section is the product of the work between the author and domain experts from the MGED-RSBI working group. Alex Garcia conceived and coordinated the work presented in this chapter. He identified the need for better using the narratives as they were being gathered. Alex Garcia also investigated how to use text-mining techniques in order to support the development of ontologies. The author conducted several meetings with members of the MGED-RSBI working group in order to develop the presented ontology.

AUTHORS' CONTRIBUTIONS Alex Garcia Castro conceived and coordinated both projects, he also wrote the manuscripts for this paper. Susanna Sansone provided useful discussion, and assisted Alex Garcia in the preparation of those submitted manuscripts. Philippe Rocca-Serra and Chris Taylor provided useful discussion.

PUBLISHED PAPER ARISING FROM THIS CHAPTER Garcia Castro A, Sansone AS, Taylor CF, Rocca-Serra P: A conceptual framework for describing biological investigations. In: NETTAB: 2005; Naples, Italy; 2005.

129

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

5

Chapter V -Narratives and biological investigations

5.1

THE USE OF CONCEPT MAPS AND AUTOMATIC TERMINOLOGY EXTRACTION

DURING

THE

DEVELOPMENT

OF

A

DOMAIN

ONTOLOGY. LESSONS LEARNT.

Abstract. Extracting terminology is not always an integral part within most methodologies for building ontologies. Moreover, the use of terms extracted from literature relevant to the domain of knowledge for which the ontology is being built has not been extensively studied within the context of knowledge elicitation. We present here some extensions to the methodology proposed by Garcia et al. (BMC Bioinformatics 7:267, 2006); two important advances on the initial proposed methodologies are the use of extracted terminology for framing the conceptual mapping building, and the use of narratives during the knowledge elicitation phases.

5.1.1

Introduction At a glance, an ontology represents some kind of world view with a set of concepts and

relations amongst them, all of these defined with respect to the domain of interest. Some scholars redefine the term in an effort to capture an absolute view of the world. For instance, Sowa [1] defines ontologies as “The study of existence, of all kind of things (abstract and concrete) that make up the world”. A more pragmatic definition is given by Netches et al. [2], who considers that an ontology “defines the basic terms and relations comprising the vocabulary of a topic area, as well as the rules for combining terms and relations to define extensions to the vocabulary”. For practical reasons we agree with this definition as the main aim of the GMS (Genealogy Management System) ontology is to define a set of basic terms that may accurately describe Germplasm, within the context of crop information systems, more specifically within the International Crop Information System (ICIS) [3]. In this paper we present our early ontology for GMS as well as the methodology we followed. Our scenario involved the development of an ontology having direct physical access to domain experts within the Australian Centre for Plant Functional Genomics (ACPFG) and the International Center for Tropical Agriculture (CIAT). We thus decided to adapt and 130

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

extend parts of different previously proposed methodologies for ontology development. We adapted the methodology reported by [4] (henceforth GM) by proposing an alternative use for concept maps (CM) [5] from the one described by [6]. We also investigated some semiautomated techniques for extracting terms from texts, as well as some other aspects of the ontology development that were not clearly illustrated in the GM methodology. This paper is organised as follows: Section 5.2.1 presents an introduction and some background information along with a brief description of our scenario. A survey on the methodologies we investigated is given in Section 5.2.2. Section 5.2.3 presents the extensions to the methodology we used; descriptions of those steps we took are also given in this section. We place special emphasis on text extraction and conceptual mapping during the elicitation process. Results (e.g. our ontology) are presented in Section 5.2.5. Our discussions and conclusions are presented in Section 5.2.6. 5.1.2

Survey of methodologies A range of methods and techniques has been reported in the literature regarding

ontology building methodologies. However, there is an ongoing argument amongst those in the ontology community about the best method to build them [7, 9]. Most of the ontology building methodologies are inspired by the work done in the field of knowledge-based engineering to create methodologies for developing Knowledge-Based Systems (KBS). For instance, the Enterprise Methodology [10], like most KBS development methodologies, distinguishes between the informal and formal phases of ontology development. METHONTOLOGY [11] adapts the work done in the area of knowledge based evaluation for the ontology evaluation phase. The “Distributed, Loosely-controlled and evolving engineering of ontologies” (DILIGENT) methodology [12] offers a set of considerations and steps suitable for loosely centralised environments where domain experts are geographically distributed. Table 1 presents a summary of our comparison. GM (GM henceforth Graph-based Methodology [4]) provided us with some detail for the knowledge elicitation process, however, our scenario was not entirely one in which domain experts were 131

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

geographically distributed and thus some of the techniques described by GM could not be directly applied to our case. We analyzed these approaches according to the criteria proposed by Mirazee [13]. Most of the methodologies do not provide details as to how one actually goes about building the ontology. Although GM had reported the use of concept maps as well as details for knowledge elicitation it is not entirely clear how, with in the process of eliciting knowledge, the narrative provided by different but complementary CMs may be reached or used. Nor there is any illustration as for the relationships between CMs and terminology extraction; empirically we could see how these two techniques complement each other. Enterprise Methodology

TOVE Methodology

Unified Methodology

Methontology

Diligent

GM

Description of stages

High-level description of stages.

High-level description of stages.

Stages are described for the chemical ontology.

High level description.

Terminology extraction Generality

N/A

Detail is provided for those ontologies developed with this methodology N/A

N/A

N/A

N/A

High level description as well as detailed information for each step N/A

Not domain specific

Not domain specific

Not domain specific

Not domain specific

Not domain specific

Ontology evaluation

Competency questions

Competency questions and formal axioms

No evaluation method is provided

An informal evaluation method is used for the Chemical ontology.

Distributed / decentralised

No

No

No

No

The community evaluates the ontology; agreement process. Yes

Usability

N/A

N/A

Chemical ontology

N/A

Supporting software

N/A

N/A

WebODE

N/A

Business and foundational ontologies N/A

Not domain specific No evaluation method is provided

Yes

Protégé, CMap tools

Chapter 5 - Table 1. Comparison of methodologies. Adapted from [4]

There is a gap between the existing software and the available methodologies. Although WebODE [14] was designed to support a particular methodology, it is general enough as to 132

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

support other methods. We studied Protégé [15], HOZO [16], and pOWL [17] as software tools for developing ontologies. Neither of them supports in any special way a particular methodology. Moreover, none of these software packages provides support for terminology extraction or conceptual mapping. All of these methods and techniques are still determined to some extent by the particular circumstances in which they are applied. We must note that, in any given circumstance there might be no available guideline for deciding on what techniques and methods to apply [18]. 5.1.3

General view of our methodology. Since none of the reported methodologies could be fully applied to our particular

scenario and/or needs, we decided to adapt and reuse some steps described in those investigated methodologies. Those modifications we introduced to the methodology proposed by Garcia et al. were mostly due to the close relationship that our domain experts had with the implemented software ICIS. This familiarity brought in some situations not fully addressed by Garcia et al., such as: • Confusion between database schemata and ontology. Domain experts were not fully aware of the difference between the conceptual and the relational model • Difficulties with those extracted terms • Domain experts were at the same time users, designers, developers, and policy makers of a particular kind of GMS. Their vision was too broad on the process but at the same time too narrow on the software Since most of those steps we took have been described by Garcia et al., we will only present details for those variations we introduced. A schematic representation of our process is given in Figure 1.

133

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 5 - Figure 1. A schematic representation of our process, extending GM.

We carefully followed, wherever possible, the GM methodology; our competency questions were formulated in natural language by domain experts, who were also leading the process and working closely with the knowledge engineer. As when building ontologies it is equally important to gather not only classes but also instances, we decided to investigate how we could better support our process by means of terminology extraction. Initially we only wanted to have classes and instances within our ontological corpus; however terminology extraction also proved to be useful during knowledge elicitation, and more specifically when combined with conceptual mapping. We used Text2ONTO [19] as our terminology extraction tool because it allowed us to use documents in its original format (PDF, XLS, DOC, TXT, etc) as the main source of information. Text2ONTO also facilitated the process of constraining the terminology by 134

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

allowing the domain experts and the knowledge engineer to inspect those inferred models from the extracted terminology. In parallel to our terminology extraction exercises our domain experts were building informal ontology models. By informal we mean basic representations of their particular view of the world with no logical constraints; a “freedrawing” exercise that helped to engage the communication between the knowledge engineer and domain experts. Text2Onto is a text-mining tool that produces a Probabilistic Ontology Model (POM) that represents the results of the system and an assigned probability for each structure mined. Text2Onto is built on top of an ontology management infrastructure named KAON [20, 21]. Text2Onto captures instances and relationships from text; by doing this, the ontological structure grows. As Text2Onto mines different structures, it attaches a value that represents how certain an algorithm is about an instance of a modeling primitive, and for each modeling primitive there are various algorithms that can calculate this probability. We were not using Text2Onto’s whole range of capacities, as we were not using KAON to build the ontology, but we were only interested in using those extracted words in order to frame the elicitation of the concept maps. Appendix 1 presents those extracted terms. The terms were extracted using the TermExtractor component of the TextToOnto ontology-engineering workbench. The TermExtractor uses the C-Value method to identify and to estimate confidence in candidate multi-word terms in a corpus [22]. It utilises linguistic methods to identify the candidate terms and then uses statistical methods to provide each term with a "C-value" indicating confidence in its "termhood". This C-value is derived using a combination of "the total frequency of occurrence of the candidate string in the corpus, the frequency of the candidate string as part of other longer candidate terms, the number of these longer candidate terms, and the length of the candidate string (in number of words)" [22]. For additional details about the algorithm see [22] and for the implementation see [20].

135

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

5.1.4

Our scenario and development process The main goal for the GMS ontology is to describe the breeding history of germplasm;

phenotypic and genotypic aspects of the germplasm are not to be considered by this ontology. The function of GMS in ICIS is to provide a unique identifier for all packets of seed for a given germplasm. It should be noted here that although almost all progenitors of present germplasm no longer exist we should know them in order to trace pedigrees. GMS also manages all the names attached to the packet of seed -homonyms, synonyms, and abbreviations. Most importantly, GMS provides a breeding history for the germplasm so that questions such as those listed below may be easily answered. • Does the germplasm belong to an out-breeding, in-breeding or vegetatively reproduced species? • Is the germplasm homozygous or heterozygous? • Is the germplasm homogeneous or heterogeneous? • What type of cultivar (fixed lines, hybrid, clone, etc.) is formed? • How has the germplasm been stored? • Where did this germplasm come from (e.g. how did I get it)? • What are its parents, grandparents, ancestors, descendants, and relatives? • What probability do they have of having genes in common? • What proportion of genes is expected to come from a list of ancestors? • What parents do they have in common? • Given an allele of a gene, from which ancestor did it come? Our domain experts in these initial phases were not geographically distributed; we could gather them in one place so they, along with the KE, could build an informal ontology model. We consider the subsequent involvement of geographically distributed domain experts in a loosely centralised environment at a later stage. We have generated a baseline ontology, different domain experts participated during the iterative building of informal ontology models. Interaction was supported by email, web based ontology browsers, and direct interviews. In order to better support the future interaction amongst domain experts we consider there is a need for collaborative ontology environments that promote collaboration from the cognitive perspective as oppose to simple file sharing systems. 136

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

5.1.5

Results: GMS baseline ontology By extracting terminology we could indistinctively gather instances and possible classes;

the knowledge engineer together with the domain experts then analyzed all this information. Some instances gathered in this way are illustrated in Figure 2. Classes, relationships and instances are part of this conceptual map, which was generated by our domain experts after confronting the extracted terminology and some of the initial ontological models. By using those informal models previously built, along with those extracted terms, we could reorganise our conceptual structure. This task resembled in many ways the card-sorting technique [23], but also a story-telling participatory exercise. Once the re-shuffling was complete, and the narratives analyzed, our baseline ontology (e.g. one containing only those seminal elements of an ontology) was ready along with a set of instances.

Chapter 5 - Figure 2. Classes, instances, and relationships gathered by bringing together extracted terms and previously built ontological models.

The result of the elicitation stage within the functional plant genomics context is illustrated in Figure 3. We gathered ten different, yet related, concept maps from two domain 137

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

experts; this graphic represents the consensus. The main aim of the process here modeled is to improve the corresponding plant material; traditionally this improvement has dealt with some specific phenotypic features such as: yield, abiotic and biotic stresses, nutritional quality, and market preferences. From the elicitation sessions we could identify several orthogonal ontologies highly needed in order to represent those processes part of the narrative we were working with. For instance, ontologies to describe “stress” and “plant yield” were needed to complement the model. In order to assist the knowledge engineer in the harmonisation of those concept maps gathered, domain experts were required to tell a unified story that could bring together those different concept maps. As a guide, domain experts had access to the list of extracted terminology. Interestingly, the story had a direct relationship with the main aim of the laboratory process; some of the GMS ontology terms were used, but the narrative was not limited to genealogies. A broader picture could thus be produced.

138

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 5 - Figure 3. Narrative, as seen from those concept maps and ontology models domain experts were building.

Our baseline ontology has classes, instances, and relationships; initially domain experts organised the classes with no consideration for time and space. For them it was important to have a coherent is-a structure they could relate to and consequently use in order to describe the genealogy of a given germplasm. Figure 4 illustrates the structure of our baseline ontology.

139

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 5 - Figure 4. Baseline ontology.

Our baseline ontology has 59 classes and 10 properties; we have not started to populate the ontology with instances as domain experts are still gathering individuals from those relevant ICIS databases. The corresponding OWL and PPRJ files for this ontology are available at: http://cropwiki.irri.org/icis/genealogy_07_01_05_a.owl, appendixes 3 y 4 present the OWL (Ontology Web Language) file containing two versions of this ontology. 5.1.6

Discussion and conclusions By showing one feasible use of text mining when building ontologies not only we

extended the methodology proposed by Garcia et al. [19], but also developed a deeper understanding of how concept maps and text mining can be used together to build narratives that can later be used in the construction of ontologies. These narratives were used not only (by us) to ease the understanding of this particular domain but also, at a later stage, by the knowledge engineer to assess the ontological corpus gathered in those different models provided by domain experts. Domain experts were requested to match some of the provided narratives against the concept maps. By doing this exercise it was possible not only to extend our lexicon but also to evaluate the informal models. Engaging domain experts in the process of building the 140

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

ontology was also simplified by the use of the narratives. Domain experts were telling a story in a structured manner, and this allowed them to better understand the is-a relationship within the class hierarchy. In the experimental method for building an ontology to describe genealogies within the context of plant breeders, we constructed several ontological models by combining terms and relationships from those mined texts. Orthogonal ontologies were easily identified as domain experts were representing their narratives as CMs. For instance, developmental stages described in Plant Ontology were present in some of the CMs, not only developmental stages, but also anatomical parts of the plant. This helped us to see more clearly how to better describe germplasm within the context of an information system that was tightly coupled to a Laboratory Information Management System (LIMS). At the time of writing this chapter, our approach was also being used by the International Center for Tropical Agriculture (CIAT) as part of their methodology for building their LIMS, paying particular attention to the identification of those orthogonal ontologies needed by that system. An important feature within narratives is the use of more than one elemental vocabulary to describe complex terms. The result of this is the creation of a relationship between the combinatorial vocabulary and each of the vocabularies that was used in its construction. The rationale behind this approach is that a plant’s anatomical vocabulary should completely describe the anatomy of the plant, and a developmental process vocabulary should completely describe all of the general biological processes involved in development. Therefore, we should be able to combine the concepts from the two vocabularies to describe all of the processes involved in the development of all of the anatomical parts of the plant. The structures are represented in CMs as well as in those baseline ontologies gathered. Initially those models contained a myriad of relationships, as the process evolved and the hierarchy becomed better structured, the "whole/part of" relationships were better defined between structures and substructures; in this the narratives proved to be very useful.

141

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

From this experience we could also identify the gap between two ontological models built by two different software packages, Protégé and KAON. As KAON serves as the "platform" on top of which Test2Onto runs, the models it produces are not readable by Protégé. It was not possible for us to exploit all the functionalities of KAON due mostly to incompatibility problems between Protégé and KAON. 5.1.7

References

1.

Sowa JF: Knowledge Representation: Logical, Philosophical, and Computational Foundation. Pacific Grove, CA: Brooks Cole Publishing Co; 2000a.

2.

Neches RF, Finin RT, Gruber R, Tom., Patil R, Senator T, Swartout WR: Enabling Technology for Knowledge Sharing. AI Magazine 1991, 11:36-56.

3.

International Crop Informtion System [http://icis.cgiar.org:8080]

4.

Garcia Castro A, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S: The use of concept maps during knowledge elicitation in ontology development processes - the nutrigenomics use case. BMC Bioinformatics 2006, 7:267.

5.

Cañas AJ, Hill G, Carff R, Suri N, Lott J, Eskridge T, Gómez G, Arroyo M, Carvajal R: CmapTools: A Knowledge Modeling and Sharing Environment. In: Proceedings of the First International Conference on Concept Mapping: 2004; Pamplona, Spain; 2004.

6.


7.

Noy NF, Hafner CD: The state of the art in ontology design - A survey and comparative review. Ai Magazine 1997, 18(3):53-74.

8.

Lopez MF, Perez AG: Overview and Analysis of Methodologies for Building Ontologies. Knowledge Engineering Review 2002, 17(2):129-156.

9.

Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools for Ontologies. In.: The Agricultural Ontology Service (UN FAO); 2003.

142

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

10.

Ushold M, King M: Towards Methodology for Building Ontologies. In: Workshop on Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95: 1995; Cambridge, UK; 1995.

11.


12.


13.

Mirzaee V: An Ontological Approach to Representing Historical Knowledge. PhD Thesis. Vancouver: Department of Electrical and Computer Engineering, University of British Columbia.; 2004.

14.

Arpiez JC, Corcho O, Fernadez-Lopez M, Perez AG: WebODE in a nutshell. AI Magazine 2003, 24(3):37-47.

15.

Noy NF, Fergerson RW, Musen MA: The knowledge model of Protege-2000: Combining interoperability and flexibility. In: 2th International Conference on Knowledge Engineering and Knowledge Management (EKAW'2000): 2000; Juan-lesPins, France; 2000.

16.

Kozaki K, Kitamura Y, Ikeda M, Mizoguchi R: Hozo: An Environment for Building/Using Ontologies Based on a Fundamental Consideration of "Role" and "Relationship". In: Proc of the 13th International Conference Knowledge Engineering and Knowledge Management(EKAW2002): October 1-4 2002; Siguenza, Spain; 2002: 213-218.

17.

pOWL [http://powl.sourceforge.net]

18.

Uschold M: Building Ontologies: Toward a Unified Methodology. In: 16th Annual Conf of British Computer Society Specialist Group on Expert Systems: 1996; Cambridge, UK; 1996.

19.

Cimiano P, Völker J: Text2Onto -A framework for Ontology Learning and Datadriven Change Discovery. In: International Conference on Applications of Natural Language to Information Systems (NLDB): 2005; Alicante, Spain: Springer; 2005: 227238

143

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

20.

Volz R, Oberle D, Staab S, Motick B: KAON SERVER - A Semantic Web Management System. In: Alternate Track Proceedings of the Twelfth International World Wide Web Conference, WWW2003: May 2004; Budapest, Hungary: ACM; 2004: 20-24.

21.

Oberle D, Eberhart A, Steffen S, Volz R: Developing and Managing Software Components in an Ontology-based Application Server. In: 5th International Middleware Conference: 2004; Toronto, Ontario, Canada: Springer; 2004: 459-478.

22.

Frantzi K, Ananiadou S, Mima H: Automatic recognition of multi-word terms: the Cvalue/NC-value method. International Journal on Digital Libraries 2000, 3(2):115.

23.

Card sorting to Discover the Users' Model of the Information Space. [http://www.useit.com/papers/sun/cardsort.html]

144

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

5.2

A PROPOSED SEMANTIC FRAMEWORK FOR REPORTING OMICS INVESTIGATIONS.

Abstract. The current science landscape is rapidly evolving and it is increasingly driven by computational tasks. The deluge of data unleashed by omics technologies, such as transcriptomics, proteomics and metabolomics requires systematic approaches for reporting and storing the data and the experimental processes in a standard format, relating the biology information and the technology involved. Ontology-based knowledge representations have proved to be successful in providing the semantics for a standardised annotation, integration and exchange of data. The framework proposed by the MGED RSBI working group would provide semantics for upper-level elements relevant to the representation and interpretation and of omics-based investigations.

5.2.1

Introduction When the first microarray experiments were published, it became apparent that the lack

of robust quality control procedures and capture of adequate biological metadata impeded the exchange and reporting of array-based transcriptomics experiments. The MIAME checklist (Brazma et al. [1]) was written in response to this lack, by a group of biologists, computer scientists, and data analysts, and aims to define the minimum information required to interpret unambiguously and potentially reproduce and verify a microarray experiment. This group then went on to make its composition official and founded the Microarray Gene Expression Data (MGED) Society. The response from the scientific community has been extremely positive and currently most of the major scientific journals and funding agencies require publications describing microarray experiments to comply with MIAME standard. The adoption of these standard by public and community databases, Laboratory Information Management Systems (LIMS) and several microarray informatics tools has greatly improved the interpretation of microarray experiments described in a structured manner. The MIAME model has been adopted by other communities (reviewed by Quackenbush [2]) and as microarrays are incorporated into other complex biological investigations (including toxicogenomics, nutrigenomics and environmental genomics), it has 145

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

become apparent that analogous minimal descriptors should be identified for these applications. There have been several extensions to MIAME. MIAME/Tox is an array based toxicogenomics standard developed by the EBI in collaboration with the ILSI Health and Environmental Sciences Institute (HESI), National Institute of Environmental Health Sciences (NIEHS), the National Center for Toxicogenomics, the FDA National Center for Toxicological Research (NCTR). MIAME/Env has been developed by the Natural Environmental Research Council (NERC) Data Centre to fulfill the diverse needs of those working in functional genomic of ecosystems, invertebrates and vertebrates which are not covered by the model organism community. MIAME/Tox and MIAME/Env have initiated several discussions in the academic settings as well as in the industrial and regulatory arenas (OECD Toxicogenomics Guidelines [3]). However it has become evident that when other –omics technologies will be used in combination with microarrays, these MIAME-based checklists will soon be insufficient to serve the scope of experimenters’ needs. The toxicogenomics, nutrigenomics and environmental genomics communities soon recognised the need for a strategy that capitalises on synergy, forming the Reporting Structure for Biological Investigations (RSBI [4]) working group under the MGED [5] umbrella. The RSBI working group feels that it is very important to agree on a single source of basic conceptual information relating to the reporting process of complex biological investigations, employing omics technologies. This unified approach to describe the upper-level elements relevant to the representation and interpretation and of these investigations should encompass any specific application. The possibility to enable ‘semantic integration’ of complex data, facilitating data mining, and information retrieval is the rationale for developing an ontologically grounded conceptual framework. Ultimately, the effort by the RSBI working group aims to constitute the foundation of standard reporting structure in publications and submission to public repositories and knowledge-bases. The need for information on which to base the evaluation and interpretation of the results underlies the objectives of presenting sufficient details to the readers and/or reviewers.

146

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

The information in complex biological investigations is highly nested and formalizing this knowledge to facilitate data representation is not a trivial task. To tackle this issue, the RSBI working group has established links with the several standardisation efforts in their biological domains (as reviewed by Sansone et al. [6]) and is working closely with the MGED Ontology working group, the HUPO [7] Proteomics Standards Initiative (PSI), the Standard Metabolic Reporting Structure (SMRS [8]) group These groups can clearly draw in large numbers of experimentalists and developers and feed in the domain-specific knowledge of a wide range of biological and technical experts. This chapter is organised as follows. In Section 5.2.2 we briefly describe the methodology we followed for developing an ontologically grounded conceptual framework; in Section 5.2.3 we present the proposed upper-level ontology, Section 5.2.4 includes conclusions and future directions. 5.2.2

Methodology Our scenario involves communities distributed geographically and for the domain

analysis and knowledge acquisition phases the group has used different independent technologies that were not always integrated into the Protégé suite (Noy et al. [9]). From these experiences members of RSBI are also working with others on a collaborative and knowledge acquisition tool for the development of ontologies integrated in Protégé (Garcia et al. [10]). Figure 5 schematises the methodology we followed. We built different models throughout our analyses of available knowledge sources and information gathered in previous steps. Firstly, a “baseline ontology” was gathered, i.e. a draft version containing few but seminal elements of an ontology. Typically, the most important concepts and relations were identified somewhat informally. We could assimilate this “baseline ontology” into a taxonomy, in the sense of a structure of categories and classifications. We consider a taxonomy as “a controlled vocabulary which is arranged in a concept hierarchy”, and ontology as “a taxonomy where the meaning of each concept is defined by specifying properties, relations to other concepts, and axioms narrowing down the interpretation.” As 147

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

the process of domain analysis and knowledge acquisition evolves, the taxonomy takes the shape of an ontology. During this step, the ontologist worked primarily with only very few of the domains experts; the others were involved in weekly meetings. In this phase the ontologist sought to provide the means by which the domain experts he or she was working with could express their knowledge. Some deficiencies in the available technology were identified, and for the most part were overcome by our use of concept maps (CMs).

Chapter 5 - Figure 5. Our methodology.

5.2.3

The RSBI Semantic Framework Our approach is one of an upper ontology that would provide high-level semantics for

the representation of omics-based investigations that serves as a conceptual scaffold from which other ontologies may be hooked. An example for the latter could be an ontology specific for the microarray technology, such as the MGED Ontology, and/or specific for an applications, such as toxicology. In order to describe the interaction of different technologies during the course of a scientific endeavour we considered there was the need for a high-level container where to place the information relevant to the biology as well as that relevant to those different assays. Our high-level concept is an Investigation, a self-contained contained unit of scientific enquiry, containing information for Study(-ies) and Assay(s). We consider a Study to 148

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

be the set of steps and descriptions performed on the Subject(s). In the cases where the Subject is a piece of tissue, and no steps have been performed but just an Assay has been carried out, then we the Study contains only the descriptors of the Subject (e.g. provenance, treatments, storage, etc.). We consider an Assay as the container for the test(s) performed and the data produced for computational purpose. There are different AssayType(s) and the different omics technologies fall within this category. A view of the RSBI upper ontology is shown in Figure 6 and the ontology is available from the RSBI webpage (http://www.mged.org/ Workgroups/rsbi/rsbi.html).

Chapter 5 - Figure 6. A view of a section of the RSBI ontology. The corresponding OWL (Ontology Web Language) file representing the ontology is presented in Appendix 1.

5.2.4

Conclusions and Future Directions Since our framework will allow the use of different ontologies the definition for

whole/part relationships should be consistent across those different ontologies. However, currently there are no standards of guidance for defining whole/part of relationships, adding another layer of complexity when developing an upper-level ontology. Upper level, or top level, ontologies describe very general concepts like space, time, event, which are independent of a particular problem domain. Such unified top-level ontologies aim at serving large communities [11, 12]. For instance, the Standard Upper Ontology (SUO) [13] provides definitions for general-purpose terms, and it acts as a 149

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

foundation for more specific domain ontologies. General purpose ontologies, such as the RSBI, provides a more specific semantic framework from which it is, in principle, possible to integrate other biological ontologies. As the RSBI aims to facilitate the annotation of biological investigations, the generality of its concepts is constrained to those predefined 3 specific domains of knowledge for which it was created (toxicogenomics, nutrigenomics and environmental genomics). Those principles recommended by Niles and Pease [13] when developing upper level ontologies were considered during the development of the RSBI ontology, however as the RSBI ontology aims to facilitate the description of biological investigations some practical considerations were also taken. Ultimately the RSBI upper-level ontology should be able to answer a few questions and position almost anything approximately in the right place, even if the spot has a nonexistent ontology. The relationship between Study and Assay defines an Investigation, different things participate in different processes and on the same token some things retain their form over time. Study and Assay contain information about those processes. It is particularly important to have minimal commitment when developing upper-level ontologies, only those concepts providing a common scaffold should be considered. 5.2.5

References

1.

Brazma, A., Hingamp, P., Quackenbush et al. 2001. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 29 (4): 365-71.

2.

Quackenbush J. 2004. Data standards for 'omic' science. Nat Biotechnol. 22:613-614.

3.

OECD Toxicogenomics Guidelines: http://www.oecd.org /document/ 29/0,2340, en_2649_34377_34704669_1_1_1_1,00.html

4.

MGED RSBI: http://www.mged.org/Workgroups/rsbi

5.

MGED Ontology: http://mged.sourceforge.net/ontologies/index.php

6.

Sansone, S.A, Morrison, N., Rocca-Serra, P., Fostel, J. 2005. Standardization initiatives in the (eco)toxicogenomics domain: a review. Comp. Funct. Genomics. 8, 633-641.

150

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

7.

HUPO PSI: http://psidev.sourceforge.net

8.

SMRS: http://www.smrsgroup.org

9.

Noy, N.F., Crubezy, M., Fergerson, R.W. et al. 2003. Protege-2000: an open-source ontology-development and knowledge-acquisition environment, AMIA Annu Symp Proc, 953.

10.

Garcia Castro, A., Sansone S.A., Rocca-Serra, P., Taylor, C., Ragan, M.A. 2005. The use of conceptual maps for two ontology developments: nutrigenomics, and a management system for genealogies. Proceedings of the 8th International Protege Conference. (Accepted for Publication)

11.

Sure, Y. 2003. Methodology, Tools & Case Studies for Ontology based Knowledge Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe

12.

Gangemi A, Guarino N, Masolo C, Oltramari A, Schneider L. 2002. Sweetening ontologies with DOLCE. In: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management Ontologies and the Semantic Web. Springer-Verlag. 166-181

13.

Niles, I., Pease, A. 2001. Towards a standard upper ontology. Proceedings of the international conference on Formal Ontology in Information Systems-Volume 2001, 2-9, 2001. ACM Press New York, NY, USA.

151

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Information integration in molecular bioscience As has previously been pointed out by this research, ontologies in biosciences are to be used by software applications that ultimately seek to facilitate the integration of information in molecular biosciences. Previous chapters of this research have explored how to develop ontologies within a highly decentralised environment such as the bio-community. However, as the involvement of the community does not just end at the time the ontology is deployed, it is also important to consider those scenarios in which ontologies are to be used by software that supports some of the activities carried out by biologists. This chapter introduces the reader to some of the problems when integrating information in biosciences. Not only technical issues concerning the integration of heterogeneous data sources and the corresponding semantic implications, but also the integration of analytical results are presented in this chapter. Within the broad range of strategies for integration of data and information, platforms and developments are here distinguished. The main contribution of this chapter is to present a view of the state of the art in data and information integration in molecular biology that is general and comprehensive, yet based on specific examples. The perspective this review gives to the reader is critical, and offers insights and categorisations not previously considered by other authors. This chapter concludes with the identification of some open issues for data and information integration in the molecular biosciences domain and argues that with a wider application of ontologies and semantic web technologies some of these issues can be overcome. This chapter contains an original critical assessment made entirely by the author, who conceived its structure, organisation and scope. The manuscripts that lead to the published paper were written by Alex Garcia; the analysis and classification presented, as well as those critical insights were worked out by Alex Garcia.

152

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

PUBLISHED PAPER ARISING FROM THIS CHAPTER Garcia Castro A, Chen Y-PP, Ragan MA.: Information integration in molecular bioscience: a review. Applied bioinformatics 2005, 4(3):157-173.

AUTHORS' CONTRIBUTIONS Alex Garcia Castro conceived the project and wrote the manuscripts for this paper. YiPing Phoebe Chen provided useful discussion. Mark Ragan supervised the project, provided useful discussion and assisted Alex Garcia Castro in the preparation of the final manuscript.

153

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

6

Chapter VI - Information integration in molecular bioscience

Abstract. Integrating information in the molecular biosciences involves more than the cross-referencing of sequences or structures. Experimental protocols, results of computational analyses, annotations and links to relevant literature form integral parts of this information, and impart meaning to sequence or structure. In this review, we examine some existing approaches to integrating information in the molecular biosciences. We consider not only technical issues concerning the integration of heterogeneous data sources and the corresponding semantic implications, but also the integration of analytical results. Within the broad range of strategies for integration of data and information, we distinguish between platforms and developments. We discuss two current platforms and six current developments, and identify what we believe to be their strengths and limitations. We identify key unsolved problems in integrating information in the molecular biosciences, and discuss possible strategies for addressing them including semantic integration using ontologies, XML as a data model, and Graphical User Interfaces (GUIs) as integrative environments.

Molecular bioscience databases (MBDBs) are an essential resource for modern biological research. Researchers typically use these databases in a decentralised manner, mobilising data from multiple sources to address a given question. A researcher (user) thus builds up and integrates information from multiple MBDBs. From an early point in the development of online databases (the early 1990s), different approaches have been explored to bring about this integration [1]. At the same time, appreciation has grown that not only data, but information more broadly, must be integrated if the full potential of MBDBs is to be realised. Data integration per se can be the least difficult part of this undertaking; indeed, data integration sometimes means little more than achieving the “ interactive hypertext flavor” of database interoperation [2]. In contrast, integrating information requires a conceptual model in which MBDBs are described in context [2] and links become meaningful relationships. The extraordinary degree and diversity of interconnectedness among biological information, and the lack of network models that capture these connections, have so far made it very difficult to achieve satisfactory integration of information in the molecular biosciences. MBDBs are essentially collections of entries that contain descriptors of sequences. Autonomous databases (often more than one) typically exist for each broad type of information (e.g. nucleotide sequence data). Most are poorly interconnected and differ in

154

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

scope, organisation [3, 4] and functionality. Both the databases themselves, and individual entries within them, may be incomplete. Sequences and their descriptors are meant to inform us about organisms and life processes. In this sense, a sequence is not merely an isolated entity, but, on the contrary, is part of a highly interconnected network. Initially at least, MBDBs were intended merely as data repositories. Later, some databases were developed to facilitate the retrieval of connected information that goes beyond sequences – for example, metabolic pathway databases (MPDBs), in which sequences are nodes of networks (subgraphs) linked by edges that represent biochemical reactions. In such a context, the meaning of a sequence is given by the way it relates to other sequences, correlating data beyond sequences and reactions. Only by understanding this context can a user formulate an intelligent query. We believe that the integration of both data and information in molecular bioscience should embody more-holistic views: how do molecules, pathways and networks interact to build functional cells, tissues or organisms? How can health, development or diseases be modeled so all relevant information is accessible in meaningful context? Developing computational solutions that allow biologists to query multiple data sources in meaningful ways is a fundamental challenge in modern bioinformatics [5], one with important implications for the success of biomedicine, agriculture, environmental biotechnology and most other areas of bioscience. There have been many diverse approaches to integrating information in bioinformatics, and it is not feasible for us to review them all comprehensively. In this review, we focus primarily on those we consider to represent the evolution of this field toward more semantically driven integrative approaches. Much of this evolution has been driven by ad hoc responses to specific needs. We attempt to provide a conceptual framework in which to analyze these and to envision the potential of emerging semantically based technologies to contribute to this domain in the immediate future. This review is organised as follows. In the first section we present an overview of issues and technologies relevant to integration of information in the molecular biosciences, and 155

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

distinguish platforms from developments. Next, we focus on data integration (Section 6.2), describing some existing platforms and developments and considering the extent to which they can be considered to integrate information. In the third section (6.3) we address deeper issues of semantic integration of molecular-biological information, highlighting the role of ontologies. Section 6.4 presents XML not only as a format for data exchanges, but also as a technology for introducing and managing semantic content. In Section 6.5 we describe how Graphical User Interfaces (GUIs) can provide integrative frameworks for data and analysis. In Section 6.6 we further detail metabolic pathway databases as a special case of integration – one in which data become more valuable in the context of other data, and in which a formal description of information relatedness and flow helps shape the description of biological processes. Section 6.7 summarises and concludes our analysis and present what we consider to be key unsolved problems.

6.1

OVERVIEW OF ISSUES AND TECHNOLOGIES

It is a defensible philosophical position that integration – of observations with each other, with previous knowledge, or with a concept or hypothesis – is a necessary part of understanding. If so, most if not all scholarly or intellectual activities are integrative, as are many research technologies and methodologies. Although such a broad conceptualisation of integration is not by itself particularly powerful, it does help us appreciate that integration of information can be, and has been, approached in many ways from diverse perspectives. We have already distinguished data integration from integration of information, and in Sections 2.1 and 2.2 we will discuss both general and more-specific approaches that provide different (partial) solutions to data integration in the molecular biosciences. These interact to greater or lesser extents with generic issues of data management and sharing, some of which (e.g. version control, persistency, security) can be adequately addressed within commercial (or, to a lesser extent, open-source) database management systems, while other issues (including

156

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

data availability, control of data quality and standardisation of formats) also have domainspecific features. 6.1.1

Data availability The scope and coordination of public databases such as those organised for the

research community by the U.S. National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ) are a characteristic feature of molecular bioscience. However, extensive data are also held privately, some proprietarily within commercial enterprises (e.g. pharmaceutical or agrichemical companies) and others available on a subscription basis. Public data can be integrated into private databases, but (although the technical issues are presumably no different than for integration among public databases) because of access policies the reverse does not happen. Initiatives such as the International Crop Information System (http://www.icis.cgiar.org) and the Global Biodiversity Information Facility (http://www.gbif.org) will likewise come up against boundaries between public and private information. 6.1.2

Data quality The major public sequence databases have instituted measures to ensure that the data

they provide to the research community are of high quality. These include standard formats and software for data submission, automated processing of submissions, and the availability of human assistance. Nonetheless, as open, comprehensive repositories, these databases necessarily contain instances of incomplete or poor-quality data, missing fields and legacy formats that can only create problems for data and information integration. These problems should be largely absent from databases that are expert-curated (e.g. Mouse Genome Database [http://www.informatics.jax.org/], UniProt/SwissProt [6]) or derived from a curated database (e.g. ProDom [http://protein.toulouse.inra.fr/prodom/current/html/home.php]), based

around

one

or

a

few

large

projects

(e.g.

Ensembl

[7],

FlyBase

[http://flybase.bio.indiana.edu/]), or otherwise narrowly focused (e.g. Protein Kinase Resource

[http://pkr.sdsc.edu/html/index.shtml], 157

Snake

Neurotoxin

Database

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

[http://research.i2r.a-star.edu.sg/Templar/DB?snake_neurotoxin/]). On the other hand, a proliferation of boutique databases could create physical, networking and other obstacles to integration. 6.1.3

Standardisation International efforts through the 1990s, in part through Committee on Data for Science

and Technology CODATA (http://www.codata.org) of the International Council for Science, led to highly coherent data formats for molecular sequence data at EBI, NCBI and DDBJ. This is despite the evolution, during that decade, of sequencing technology from manual slab gels with autoradiographic detection to automated slab gels with fluorescent detection, to today’s capillary-based technologies. However, data standardisation remains a major issue in fields where it may be less obvious what experimental conditions are relevant to interpretation and where alternative technologies may be intrinsically less compatible. The MIAME/MGED (minimum information about a microarray experiment/Microarray Gene Expression Data Society; http://www.mged.org) and MAGE [8] (microarray and gene expression) initiatives among the expression microarray community, and the Proteomics Standards Initiative (PSI; http://psidev.sourceforge.net), exemplify the efforts being undertaken to establish data standards for newer types of molecular data. Technological issues cut across the integration of information in diverse ways, many of which are discussed, in greater or lesser detail, in the sections that follow. Two others bear further mention here: language and access. 6.1.4

Language Integrative frameworks or developments can be built using either general purpose or

specialised languages. Thus, for example, the annotation pipeline PRECIS [9] is based on Awk and Perl, while MAGPIE [10] uses Prolog, C, Perl and (for the GUI) JavaTM/Javascript. By contrast, the data source-wrapping functions of Sequence Retrieval System (SRS) [11, 12] are implemented in a unique language ICARUS, while Kleisli [13, 14] makes use of a highlevel query language called sSQL (Semi-Structured Query Language). Open Bioinformatics 158

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Foundation

projects

such

as

Bioperl

(http://www.bioperl.org),

BioJava

(http://www.biojava.org) and the like provide modules, scripts and sometimes ready-to-use code for many different tasks in bioinformatics. Open Bioinformatics Foundation projects are an effort of the community to provide repositories of useful tools for end users in the cases of stand-alone applications, as well as for developers in the case of Perl modules that can later be used in different in-house applications. 6.1.5

Access “Grid” initiatives refer to a vision of the future in which data, resources and services

will be seamlessly accessible through the Internet, in the same sense that the electricity delivered to our homes and offices is generated and transmitted via diverse power plants, transmission lines, substations and the like that electric-power users rarely have to think about. “The grid” will actually be multiple grids (data grid, computation grid, services grid) and will be useful only to the extent that relevant components communicate with each other. Many existing grid initiatives are coordinated through the Global Grid Forum (http://www.ggf.org) in which the bioinformatics community is actively represented. It is envisioned that the computational grid will be implemented using a standard “toolkit” of reference software (http://www.globus.org). Other initiatives focus on how data can be most efficiently shared across a data grid. The Life Science Identifier (http://Isid.sourceforge.net), for example, has been proposed as a uniform resource name (URN) for any biologically relevant resource. It is being offered as a formal standard that would be served on top of, not as a replacement for, existing formats or schemata. A third interrelated set of perspectives on information integration relates to the interface with humans. This is necessarily a broad area, and many issues relate more to physiology, psychology or sociology than to bioinformatics per se. The most typical interface with individual humans (users) is the GUI, discussed further in Section 5.

159

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

6.2

6.2.1

STRATEGIES FOR DATA INTEGRATION

Platforms We recognise two broad strategies for integration of data and information in molecular

biology: (i) provision of a general platform (framework, backbone) and (ii) addressing a specific problem via a specific development. A platform for data integration offers a technological framework within which it is possible to develop point solutions; usually platforms provide non-proprietary languages, data models and data exchange/exporting systems, and are highly customizable. By contrast, a development may not provide technology applicable to other problems even in the same domain. A platform is meant to be a deeper layer over which several heterogeneous solutions may share a common background, and in this way some degree of interoperability can be achieved. Kleisli and DiscoveryLinkR [15] are examples of platforms over which heterogeneous data can be integrated. Platforms provide a data model, can query optimisation procedures, and provide a general query language as well as flexible data exchange mechanisms. We consider developments to be proposed solutions that are either built on top of platforms for data integration or make extensive use of wrappers or parsers. They do not provide a significant environment, and their integrative context is limited. Davidson et al. [16] define three different classes of integrative strategies: (i) link-driven federations, (ii) view integration and (iii) warehousing. Proposed solutions in which users must navigate through predefined links (provided by the system) among different data sources in order to extract information are called link-driven federation solutions. SRS and GeneCardsR [17] are examples of this category. In the view integration approach, the schemata of a collection of underlying data sources are merged to form a global schema under some common model. Users query this global schema using a high-level query language such as CPL (Combined Programming Language), SQL (Structured Query Language) or OQL (Object Query Language). The system determines what portion of the query can be answered by which data source, ships local 160

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

queries to the appropriate data source and then combines the answers from the various data sources to produce an answer to the global query. Kleisli is one example of this type of integrative strategy. The warehousing approach is distinguished by holding information in a common location (the warehouse) instead of in multiple locations. Data warehousing requires a unified data model that can accommodate the information in the component data sources; it also needs a series of programs that fetch the information and transform it into the unified data model [18]. Data warehouses are not easy to implement and administer, and keeping the information up to date is further complicated if changes in the original data sources require structural changes in the warehouse. Kleisli and DiscoveryLink® can be considered platforms for data integration. Although not fully integrated strategies, they address data integration under a broader perspective than does an individual development. Kleisli: Kleisli is a mediator system, encompassing a nested relational data model, a highlevel query language, and a query optimiser. It provides a high-level query language, sSQL, which can be used to express complicated transformations across multiple data sources. The sSQL module can be replaced with other high-level query languages. It is possible to add new data types if an appropriate wrapper is available or if one can be added. Kleisli does not have its own Data Base Management System (DBMS); instead, it has functionality to convert many types of database systems into its nested relational data model. Kleisli does not require schemata; its relational data model and its data exchange format can be translated by external data databases [14]. Kleisli is thus a backbone that is not limited in application to the biological domain. DiscoveryLink®: DiscoveryLink® is also a mediator system. This product (and its emerging

successor,

WebSphere®

Information

Integrator

[http://www-

306.ibm.com/software/data/integration/partners.html]) address the problem of integrating information from a much broader perspective via a technological platform that enables the user to query a broad range of file types that are part of a federation. DiscoveryLink® exhibits 161

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

several advantages over other proposed solutions, basically because it relies on a de facto data model for most commercial DBMSs. For example, users are able to carry out post-query manipulations, and can use explicit SQL statements to define their queries; query optimisation is also a feature of DiscoveryLink®. However, adding new data sources or analysis tools into the system is not a straightforward process. DiscoveryLink® wrappers are written in C++, which is not necessarily the most suitable programming language for wrappers [14]. Extensive knowledge of SQL and of relational database technology is needed. DiscoveryLink® is built over IBM’s DB2® technology from IBM, which is a commercial product. Although new data sources can often readily be incorporated within the DiscoveryLink® federation, it may be much more difficult to integrate DiscoveryLink® per se with non- DiscoveryLink® environments. Kleisli and DiscoveryLink® support queries across a heterogeneous federation using a standard language. Both systems allow users to manipulate BLAST® results via SQL statements, integrate flat files into the federation via a wrapper and query federated flat files using SQL, either in a preformatted graphical way or via a command line. It is possible to have Microsoft® Word and Excel® and also text files as part of the federation. XML files and the PubMed database can also be accessed. Users typically need extensive knowledge of the schemata that represent the information and a high-level understanding of the system. Neither system is powerful in the hands of a naive end user. Wrappers, such as those used in both Kleisli and DiscoveryLink®, mediate between query system and specific data source (or type of data source). Thus, systems that wrap multiple heterogeneous data sources thus translate their data into a common integrated data representation [19]. Wrappers thus provide a kind of lingua franca through which two different databases communicate and produce a result. Retrieval components in wrappers map queries with common gateway interface (CGI) calls. Changes in data sources make it difficult to maintain wrappers.

162

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

6.2.2

Developments There have been many developments, some of which offer a framework over which

the integrating of analytical tools is possible, whereas others were designed to provide GUI capabilities only for very specific algorithms. Commercial packages such as VectorNTI® (http://www.invitrogen.com),

Lasergene®

(http://www.dnastar.com)

(http://www.accelrys.com/products/gcg/seqweb.html)

basically

and

offer

SeqWeb® additional

functionality (e.g. access to databases of plasmids sequences, tools for immediate visualisation of 3-dimensional structure, etc.) and facilities for direct manipulation of specialised hardware devices. In this section we present some of the existing developments currently available. A summary is given in Table 1. We do not pretend to cover all of the existing developments; rather, we present some representative developments. A more detailed list of online tools is available in the supplementary material at http://130.102.113.135/sup_material.html. 6.2.2.1

Sequence Retrieval System (SRS)

SRS is an information indexing and retrieval system designed for libraries with a flat-file format such as the EMBL [20] Nucleotide Sequence Database, the Swiss-Prot [6] protein sequence databank or the PROSITE [21] library of protein subsequence consensus patterns. SRS was intended to be a retrieval tool that allows the user to access as many different biological data sources as possible via a common GUI. It is relatively easy to integrate new data into SRS. SRS wraps data sources via a specialised, built-in wrapping programming language called ICARUS. It can be argued, however, that parsers should be written in a general purpose language, rather than a language being built around a parser [22].

163

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Features Integration of new data sources Query language Query optimisation Data model

Data exchange Technology involved Graphical User Interface (GUIs) Level of link-driven solution Analysis tools available Licensing

Online tools SRS Allowed

GeneCards Possible but requires recoding NA

Entrez Yes by central administration

Ensembl Possible

BioMOBY Via the registry service

API

SQL/Perl

API

NA

NA

NA

NA

No true data model

NA

NA

NA

Low

Low

Low

Oriented to solving a specific problem Medium

Perl, JavaScript, ICARUS Intuitive, functional

Perl / Glimpse Intuitive, functional

NA

XML / SOAP / Perl / JavaTM Early stages of development

High

High

Intuitive, functional, navigable, friendly High

MySQL® / Perl Intuitive, functional, navigable Medium

Medium

Available

NA

Available

Available

Available

Free for academic use, local installation

Free for academic use

Access over the Web

Some query language capacities via ICARUS NA

Low

Free for Free, download academic use, from the MOBY local website installation API = application program interface; NA = not available; SOAP = Simple Object Access Protocol; SQL = Structured Query Language; SRS = Sequence Retrieval System. Chapter 6 - Table 1. Some existing developments in database integration in molecular biology

SRS provides integration capacities by allowing the user to navigate through information contained in different heterogeneous data sources, and to capture entries. As such, it is a good example of a link-driven federation. With SRS, the user still has to know the schema of each database, and formulate suitable and valid queries. Further operations or transformations are not easy in SRS. To some extent, SRS may be seen as a GUI integration approach; not only deeper integration, but also even the capacity for interoperation, is limited. 164

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

As a relatively popular attempt at a unified GUI to heterogeneous data sources, SRS may provide important information about Human-Computer Interaction (HCI) in the bioinformatics field. Neither Kleisli, nor DiscoveryLink®, is deeply comparable with SRS. This is because SRS focuses more on an integration of molecular data, whereas DiscoveryLink® and Kleisli were, from their beginnings, products designed to allow the biomedical community to access data in a wide variety of file formats. 6.2.2.2

GeneCards®

GeneCards® is a compendium of information on individual human genes. It has been built using Perl, with Glimpse (Global IMPlicit Searh; http://webplimpse.net) as an indexing system. GeneCards® may be seen a classical collection of human gene-related data, implemented in a single information space and with some query capacities. It does not provide a clear integration perspective, and should not be seen as a solution beyond its initial purposes. Adding new data is not a straightforward process. 6.2.2.3

Entrez

Entrez [23, 24] is an integral part of the NCBI portal system, and as such is an integrative solution within the NCBI problem framework. It provides a single portal for access to most existing genomes, along with some analysis tools and database querying capacities for genomic, protein and bibliographic information about specific genes. Graphical displays for chromosomes, contig maps and integrated genetic and physical maps are also available. Entrez also links each data element to neighbors of the same type [23]. 6.2.2.4

Ensembl

Ensembl is not by itself a data integration effort, but rather an automatic annotation tool. The task of annotation involves integrating information from different data sources, at different levels and using different methods in concert. Emsembl provides general visualisation tools and the ability to work with different data sources. Ensembl relies on open165

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

source projects such as BioJava and BioPerl. Raw data are loaded into the MySQL®-based (http://www.mysql.com) internal schema of Ensembl and processed through its annotation pipeline; results can be visualised using, for example, the Apollo [25] genome browser. Query capacities in Ensembl are limited, but more flexible capacities may be achieved by addition of Perl or Python scripts. Ensembl does not inherently provide alternative data models (or the flexibility to supply alternative models), data exchange formats or substantial data exchange capacities. Thus, Ensembl is not a data integration, but rather a solution that addresses a specific problem (genome browsing and automatic annotation facility). 6.2.2.5

BioMOBY

BioMOBY (http://www.biomoby.org) proposes an integrative solution based on web client/service/registry architecture. In BioMOBY, data repositories are treated as web services. The BioMOBY project was established to address the problem of discovering and retrieving related instances of biological data from multiple hosts and services via a standardised query and retrieval interface that uses consensus object models. BioMOBY allows users to relate information by using a priori knowledge of what has been previously considered to be relevant information. The idea behind MOBY is simple: a registry acts as a server in which knowledge of the different services is stored; different clients make use of this central facility, thereby allowing the formulation of queries, and a client interacts with different data sources regardless of the underlying schema. BioMOBY provides a common format for the representation of retrieved data regardless of their origin, and in this way eliminates the need for endless cutting and pasting [26]. Relying on the paradigm of universal data discovery and integration (UDDI), BioMOBY integrates data by integrating the different services in which these data are stored. At this point in its evolution, BioMOBY addresses a specific type of problem, and does not provide a real framework over which other solutions can be built.

166

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

6.2.2.6

myGrid

MyGrid [27] is a prototype grid environment for bioinformatics, aiming to integrate web facilities in a grid of bioinformatics services. It addresses the problem of integration by identifying the main sources of information: literature sources, analysis methods and databanks. MyGrid provides high-level services for data and application integration such as resource discovery, workflow enactment and distributed query processing. MyGrid combines analysis and query processes in a concrete, executable workflows. It provides an integrated ontology where data and analysis sources are described. MyGrid offers a high-level middleware solution to support personalised in silico experiments over a service grid. The myGrid project anticipates a service grid where data sources and analysis tools are orchestrated into a service grid of reusable components. MyGRID recognises three different categories of services: (i) services for forming experiments, (ii) services for discovery and metadata management, and (iii) services for supporting e-science. While it is true that workflows in bioinformatics interleave query and analysis processes, these are not a single unified process, and administering such description over the service grid has proven to be difficult. 6.2.2.7

Others

Other projects also provide different functional capabilities and integrate information from heterogeneous sources for a particular purpose or for a specific community. FlyBase and WormBase (http://www.wormbase.org/) are examples of such integrative efforts that aim to provide ‘all’ the available information related to a particular organism.

6.3

SEMANTIC

INTEGRATION

OF

INFORMATION

IN

MOLECULAR

BIOSCIENCE

Syntactic integration basically deals with heterogeneity of form - the structure but not the meaning. Semantic integration, on the other hand, fundamentally deals with the meaning 167

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

of information and how it relates within a specific field. It addresses the problem of identifying semantically related objects in different databases, then resolving the schematic (schema-related) differences among them [28]. A simple scenario where semantic implications matter is one in which a protein may be identified in a particular databank with a certain accession number, but may have a different identifier, or even a different annotation, in another database (i.e. may appear non-synonymous). In the case of bioinformatics, semantic integration could (and we argue, should) be seen to encompass not only the taxonomy of terms (controlled vocabulary) or the resolution of semantic disagreements (e.g. between synonyms in different databases), but also the discovery of services (databases and/or analysis algorithms). Semantic integration of MBDBs thus focuses, at some level, on how a database entry can be related to other information sources in a meaningful way. Our previous descriptions of database integration (in Section 2) addressed the problem of querying, and extracting data from, multiple heterogeneous data sources. If this can be done via a single query, the data sources involved are considered interoperable. We have not so far considered how a particular biological entity might be meaningfully related to others; only location and accessibility have been at issue. In the same way, the complexity of a query would be largely a function of how many different databases must be queried, and, from these, how many internal subqueries must be formed and exchanged for the desired information to be extracted. If a deeper layer that embeds semantic awareness were added, it is probable that query capacities would be improved. Not only could different data sources be queried, but (more importantly) interoperability would then arise naturally as a consequence of semantic awareness. At the same time, it should be possible to automatically identify and map the various entries that constitute the knowledge relationship, empowering the user to visualise a more descriptive landscape. A richer example of why semantics matters may be seen with the word ‘gene’, a term that has different meanings in different databases. In GenBank® [29], a gene is a “region of biological interest with a name that carries a genetic trait of phenotype” and includes non168

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

structural coding DNA regions including introns, promoters and enhancers. In the Genome Database [30], a gene is just a “DNA fragment that can be transcribed and translated into a protein”. The RIKEN Mouse Full-length cDNA Encyclopedia [31], which focuses on fulllength transcripts, refers to the transcriptional unit instead of the gene. Queries involving databases that, among themselves, present such semantic issues necessarily have limited operability. Ontology can provide a guiding framework within which the user can restrict the query to the context that it makes sense within and can navigate intelligently across terms. Semantic approaches thus depend heavily on ontology. What is ontology? Notions of what ontology is, and how it should be implemented, differ but include the following: (i) a system of categories accounting for a particular vision of the world [32]; (ii) specification of conceptualisations [33]; (iii) a concise and unambiguous description of principal relevant entities with their potential, valid relations to each other [34]; and (iv) a means of capturing knowledge about a domain, such that it can be used by both humans and computers [35] In the molecular biosciences, an ontology should capture, in axioms, the relations among concepts. These axioms might then be used to extract implicit knowledge such as the transitive closure of relations (if an enzyme is a type of protein and a protein a type of polypeptide, then an enzyme is a type of polypeptide) [36]. Ontology may also provide a framework for describing living systems in terms of information. For example, metabolic pathways describe many different chains of reactions that relate different biological entities. These complex networks reflect a deep layer of concepts that describe the system and, if represented appropriately, could support visualisation, querying, and implementation of further analyses. Thus, we see that an ontology is not simply a controlled vocabulary, nor merely a dictionary of terms. Controlled vocabulary per se describe neither relations among entities nor relations among concepts, and consequently cannot support inference processes. Database schemata describe categories, and provide an organisational model of a system, but do not necessarily represent relations among entities. Database schemata can be derived from ontologies, but the reverse step is not so straightforward. An ontology might better be 169

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

considered as a type of knowledge base in which concepts and relations are stored, and which makes inference capacities available. Ontologies, and metadata based on ontologies, are sometimes presented as means to support the sharing and reuse of knowledge [37]. Application of an ontologically based approach should be more powerful than simple keyword-based methods for information retrieval. Not only can semantic queries be formed, but axioms that specify relations among concepts can also be provided, making it possible for a user to derive information that has been specified only implicitly [38]. In this way, relevant entries and text can be found even if none of the query words is present (e.g. a query for “furry quadrupeds” might retrieve pages about bears). Data integration in MBDSs raises problems of syntactic and semantic heterogeneity. Semantic integration is a difficult task. In the special case where a top-level concept matches another concept in a different ontology, merging these two branches would require all derived concepts to be checked; as yet, this cannot be automated. An ontology to be applied across multiple databases might best be placed centrally so each database can be mapped to it directly, not via other databases. General inference algorithms might then identify identical or similar concepts in other databases [39]. Can the use of ontology improve query capacities? We believe it can, but much optimisation of query logic will be required if full benefits are to be won. An ontological veneer over existing databases will achieve little. With the current state of MBDBs, ontology might be helpful primarily as a flexible guidance system, supporting the user in building queries by relating concepts. To avoid the philosophically difficult question of what constitutes a related concept, we prefer to think not in terms of related concepts in general, but rather about restricting relations to a defined context. Molecular biology has an emerging de facto standard ontology. The Gene OntologyTM [40] (GO) consortium, established in 1998, provides a structured, precisely defined, common controlled vocabulary for describing the roles of genes and gene products in eukaryotic cells. GO embodies three different views of these roles: as functions, as processes and as cellular 170

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

components. Although developed as a taxonomy of concepts and their attributes for eukaryotic cells, the GO framework is, in principle, extensible to other biological systems. GO is organised as a directed acyclic graph: a child can have many parents. Although GO is not the only ontology relevant to molecular biosciences, it has been used to annotate genes across multiple organisms. Insofar as it embodies the definition of a standard for the annotation of gene products, GO integrates genomic data and protein annotation from many sources [41]. Ideally we should be able to use high-quality annotation to explore a variety of hypothesis about evolutionary relations and other comparative aspects of genomics. Although the number of ontologies formally used in bioinformatics applications is still small, where they have been used they span a wide range of proposes, subject areas and representation styles. So far, ontologies have been used in three distinct areas: database schema definition (e.g. EcoCyc [42]), annotation and communication (e.g. GO), and query formulation (e.g. TAMBIS [43]). TAMBIS is a special case of a point development. TAMBIS, in Figure 1, is modelled as a knowledge base, and provides a single unified access to multiple data sources. It is a highlevel ontology-centric query facilitator in which queries are formulated against a canonical representation of biomedical concepts, and where maps of terms relating this representation are linked with external data sources [44, 45]. TAMBIS therefore provides a level of interaction between the user and the external sources that removes the need for the user to be aware of the schema. It is based on a threelayer mediator/wrapper architecture [46] and uses Kleisli as a backend database. TAMBIS is intended to improve query capacities in MBDBs via its supplied conceptual model, a knowledge-driven user interface and a flexible representation of biological knowledge that supports inference processes over the relations among concepts. The representation is implemented

using

GRAIL

Description

Logic

(http://www.openclinical.org/

dld_galenGrail.html). With TAMBIS, the user is guided over an ontologically informed map of concepts related to a given query. This is done by exposing the user to the terminological model, and by providing a guided query formulation system implemented in a graphical tool. 171

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Queries are written into a form-based interface positioned over a source-independent ontology of bioinformatics concepts [46]. The user is therefore not navigating through links, but moving in a conceptual space. TAMBIS builds queries and Kleisli executes them.

Chapter 6 - Figure 1. Schematic representation of the architecture of TAMBIS. CPL = Combined Programming Language

Genotype-phenotype relations are modeled in PharmGKB (http://pharmgkb.org). The Pharmacogenetics and Pharmacogenomics Knowledge Base [47] PharmGKB is an Internetbased resource that integrates complex biological, pharmacological and clinical data. PharmaGKB contains genomic, phenotypic and clinical information collected from ongoing pharmacogenetic studies. PharmGKB is organised as a knowledge base, with an ontology that contains a network of genetic, clinical and cellular phenotype knowledge interconnected by relations and organised by levels of abstractions [48]. 172

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

The relevance of semantics in bioinformatics extends beyond issues of the representations of information stored in databases. As biological research moves increasingly in silico and becomes increasingly less easily distinguishable from computational and information science, semantically based web technologies are likely to contribute to the unification of grid and workflow environments. myGrid is intended to provide an environment in which defined workflows are reusable and shared by means of semantic layers. Semantic description helps make the knowledge behind a workflow explicit and, thus, shareable.

6.4

XML AS A DESCRIPTION OF DATA AND INFORMATION

XML is a language for describing data. It is becoming a standard document format in areas well beyond the biosciences [49]. By itself, XML provides neither a description of the data so formatted, nor an integrative solution. But because XML is becoming a de facto data exchange format in molecular biosciences (i.e. the standard data format for exchange purposes) we include a discussion on XML and the way it is being used. Data migration between programming languages is a problem in bioinformatics. XML offers a way to serve and describe data in a uniform and automatically parsable format. Powerful XML-based query languages for semi-structured data have been developed (e.g. Xpath

[XML

Path

Language;

[http://www.w3.org/TR/xquery]

and

http://www.w3.org/TR/xpath20], XQL

[XML

Query

Xquery Language;

http://www.w3.org/TandS/QL/QL98/pp/flab.txt and http://www.ibiblio.org/xql/]). As the relational data model does not by itself adequately represent information in the molecular biosciences, it may be useful to build around XML a framework for integrating tools in molecular biology, for querying and for transforming results for analysis by appropriate agents [50]. A robust, stable XML data integration and warehousing system does not yet exist. However, once high-performance data stores become available [49], perhaps in a grid

173

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

environment, XML and XML-based tools may mature into an alternative data integration platform comparable with Kleisli and DiscoveryLink®. Today, researchers in the biosciences rely mostly on string matching and link analysis for database searching; computer functionality (operations relevant to different data types) in MBDBs is very limited. However, if computers were able to understand the semantic implications not only of data but also of queries, then more functionality and accuracy might be added to database searching operations. XML provides a general framework for such tasks. Although not a descriptive language, XML provides a representational framework for semantic values to be introduced in the description of content. XML promotes interoperability between agents that agree on some standards. Conversely, disparity in vocabulary undermines the potential use of XML in database integration. Application of ontology, however, could provide a complete and conceptually adequate way of describing the semantics of XML documents. By deriving document type definitions (DTDs) from an ontology, document structure can be grounded on a true semantic basis, making XML documents an adequate input for semantically based processing [38]. In this way, conceptual terms can be used to retrieve facts. An example of a software development making extensive use of XML is LabBook (http://www.labbook.com). LabBook is a genomic XML viewer and browser based in BSMLTM (Bioinformatics Sequence Markup Language; http://www.bsml.org); BSMLTM is an open XML data representation and interchange format. Both sequence and browser provide an intuitive graphical environment for visualising genomic information. XML has been extensively used in bioinformatics as an exchange format, but complete XML integrative solutions have not yet been developed. We believe that XML should be understood as a powerful data model, since XML allows flexible definition of set of tags as well as the hierarchical nesting of tags. BioXML came about as an effort to develop standard biological XML schemata and DTDs. It was intended to become a central repository, part of the Open Bioinformatics Foundation (http://www.open-bio.org). However, the BioXML project appears to be inactive. BioMOBY inherited some of the desired features of BioXML; 174

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

within MOBY lightweight XML documents comprise a set of descriptor objects that are passed from MOBY central to the clients.

6.5

GRAPHICAL

USER

INTERFACES

(GUIS)

AS

INTEGRATIVE

ENVIRONMENTS

Querying databases is only one part of the research process in molecular biosciences. Once relevant data have been retrieved, analysis must be undertaken. Very often the process is not clear in advance, and the user must iteratively query, retrieve, analyze and compare until the desired endpoint is attained. These steps are most easily carried out in an integrated environment within which the functionality of MBDBs is brought together with appropriate analysis tools, allowing the user to specify and carry out computational experiments, record intermediate and final data, and annotate experiments. Analysis tools in molecular bioscience are likewise heterogeneous, and may typically include remote webtools, locally installed executables, and scripts in, for example, Perl, Python and/or SQL. Especially among but even within application fields, interoperability tends to be limited or nonexistent. Instead, the output of one program must usually be reformatted for input into the next; this piping is mostly done using purpose-written Perl parsers, of which there is no central library or listing. The GCG® [51] and EMBOSS [52] suites are two well-known sequence analysis software packages that group many methods commonly used in molecular biology. They are both command-line driven, which requires users to have at least a basic familiarity with UNIX® command-line syntax. Therefore, in recent years some groups have developed GUI systems [53, 54] that suppress the syntactic complexity of UNIX® commands, thereby promoting the coordinated use of the programs in these packages. A list of some of the existing GUIs for EMBOSS [52] and GCG® is given in Table 2.

175

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Feature/GUI Jemboss Program that EMBOSS the GUI is for

W2H Pise EMBOSS / EMBOSS GCG®

Scripting language used Functionality / remarks

JavaTM

Perl

XML/Perl

Provides GUI, has some data management utilities.

Provides data management tools. Highly functional, new resources can be added

Highly functional, highly customisable. Easy to add new methods.

W3H; Process limited automation NA capability Perl API API = application program interface; NA = not available.

Perl2EMBOSS SeWeR EMBOSS General GUI for web resources Perl HTML, JavaScript Provides GUI. No data management utilities.

GUI allows access to different analytical tools; not GCG® / EMBOSS focused. Limited integration with Pise is possible.

NA

NA

Chapter 6 - Table 2. Some of the most commonly used Graphical User Interfaces (GUIs) for EMBOSS and GCG®

Pise [54] is a particularly important example. Its applicability is based around Perl modules generated from the XML description of a targeted program. Each module contains all the information needed to generate the corresponding HTML form and CGI script. Pise has some scripting and macro capabilities, and supports the specification of pipelines. An example of a Pise macro may be: ClustalW* - DNADIST* - NEIGHBOR.* The output of one program (ClustalW) becomes the input of another (DNADIST, from the PHYLIP package), and then again the output from DNADIST is passed to and used as the input file for the NEIGHBOR program. The graph of all possible paths may be seen at the Pise website at http://wwwalt.pasteur.fr/~letondal/Pise/gensoft-map.html. Pise provides two different ways by which a 176

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

macro can be customised: either a form (supplied within Pise) can be filled out with parameters supplied by the user or the user can save an entire procedure as a macro. The user is presented with a ‘Register the whole procedure’ button that builds the scripts, and allows users

to

repeat

their

actions

[54].

G-PIPE*

(http://if-web1.imb.uq.edu.au

/Pise/5.a/gpipe.html) is a development build on top of Pise. It provides an automatic workflow generator using Pise as a GUI framework. The workflow descriptions are stored as XML documents that can later be loaded and run across different G-PIPE servers. This automatisation is possible thanks to a set of Perl modules that check the syntactic consistency of the different files in order to evaluate them as possible input files for the different steps in a given workflow. W2H [53] is one of the oldest and most powerful GUIs in bioinformatics. In a sense, it has evolved from a GUI into an environment, as it provides not only GUI capacities but also some functionality for file handling. W2H was developed making extensive use of the metadata files that describe the applications available in the former GCG® package. In W2H, these files are used to generate on-the-fly HTML documents with forms for entering values of command-line parameters [53]. W2H embodies a classical tool-oriented approach; combinations of tools were not initially supported [55]. W2H now provides some problemoriented tools (a task framework), allowing users to define data workflows. For this, W2H again makes use of metadata, as well as descriptions of workflow and dataflow. Workflow in this context refers to the sequence of tasks (methods, programs) that are part of a user’s analysis chain. Dataflow is basically the parsing of one output into the subsequent task. Using the existing W2H, the dataflow description is used by the web interface to dynamically create HTML input forms for the task data. With the given metadata, the web interface can collect input from the user, determine if all minimum requirements are fulfilled, and provide the data to the task system. The name given to this task framework is W3H [55]. W3H reduces the amount of necessary programming skills, however the definition of tasks is not an automatic process. Shareability of tasks among other GUI environments is not possible under W3H. This feature was considered from the beginning in the design of 177

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Pise/G-PIPE, where the whole task or workflow can be exported as an XML file (which may later be loaded into the same system as Pise/G-PIPE) or as a Perl script that can be customised by the user. Both W3H and Pise make extensive use of Bioperl. W3H is immersed within the HUSAR (Heidelberg Unix Sequence Analysis Resources) environment, and has pre-built parsers that enable connectivity with different datasets available on the local HUSAR installation of SRS, GeneCards® and other databases/facilites. PATH (Phylogenetic Analysis Task in HUSAR) was developed within the framework provided by W3H [56]. Dependencies among applications, descriptions of program flow, and merging of individual outputs (or parts thereof) into a common output report are provided to the system. cDNA2Genome [57] is another task developed under the W3H framework. It allows high-throughput mapping and characterisation of cDNAs. cDNA2Genome can be divided into three main categories: database homology searches, gene finders and sequence features’ predictors (i.e. start/stop codons, open reading frames). cDNA2Genome is available at http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar/. Perl2EMBOSS (http://bioinfo.pbi.nrc.ca/~lukem/EMBOSS/) is an example of a simple Perl-based GUI to EMBOSS. Unlike W2H and Pise, Perl2EMBOSS cannot be used as it stands for GCG® or any other package, and does not provide task definition capacities or an application program interface (API). Perl2EMBOSS is easy to install and provides forms (Perl scripts) for the building of input files. As its source code is well documented and well structured, this GUI is easy to administer and simple for the user. It may be considered a ‘ light’ solution. SeWeR [58] (SEquence analysis using WEb Resources) is a GUI for a different scenario from EMBOSS or GCG®. It was designed to make extensive use of JavaScript and dynamic HTML (DHTML), and its capacities thus provide a very lightweight solution. It presents a uniform interface to most common services in bioinformatics, including polymerase chain reaction-related analyses, sequence alignment, database searching, protein structure prediction and sequence assembly. It provides for several levels of customisation to the interface and is highly amenable for batch processing and automation. 178

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Jemboss [59] is yet another a GUI for use with EMBOSS. In this case, a web launch tool (Java Web Start) must be installed on the client’s computer. The user is presented with an intuitive window that gives access to his or her assigned area on the server on which EMBOSS

is running.

Jemboss

uses

SOAP (Simple

Object

Access

Protocol;

http://www.w3org/TR/soap/), reducing security risks by allowing the user to access the EMBOSS application as a client. The display area gives the user complete control over the environment; analyses are run on the defined EMBOSS server. Via a job manager, it is possible to run and monitor batch processes [59]. All of the GUIs described above provide graphical access to a specific set of analysis tools; they do not provide integration between retrieval systems such as those in GeneCards® or SRS. The coded functionality available on W3H in this respect is very limited; sequences are identified and users can get intermediate access to databank entries. However, this limited integration is not enough, as even simple operations such as automatic presentation of analysis options over a set of previously identified sequences are not available. Such integration (context menus embedded within the GUI) would define an environment within which query capacities and analytical tools coexist in a single, unified working area. The selection of one of these GUIs over another depends entirely on the problem at hand. All provide in essence the same features, and source code is available for each. The definition of analysis pipelines remains limited among these solutions.

6.6

METABOLIC

PATHWAY

DATABASES

AS

AN

EXAMPLE

OF

INTEGRATION

A pathway can be defined as a linked set of biochemical reactions, such that the product of one reaction is a reactant of, or an enzyme that catalysis, a subsequent reaction [60]. A MPDB is a bioinformatics development that describes biochemical pathways and their reactions, components, associated experimental conditions and related relationships. To the extent that it is sufficiently comprehensive, a MPDB can be seen as a description of an 179

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

organism at the metabolic level. In the same way, other aspects of an organism can be described in gene regulation databases, protein-protein interaction databases, signal transduction databases and so forth. The techniques used for building metabolic pathways range from manual analysis to automated computational methods. The resulting databases differ in the types of information they contain and in the software tools they make available for queries, visualisation and analysis [42]. Quality is assured by following combinations of manual and automatic curation processes. This is the case for BRENDA® [61] (Braunschweiger Enzyme Database); this manually curated database contains information for all those molecules that have been assigned an Enzyme Commission (EC) number. By querying the database it is possible to retrieve information about an enzyme for all the organisms in which it is present. BRENDA® is rich in literature references; these are parsed for relevant key phrases directly from PubMed and are then associated with the corresponding enzymes. Another example of an MPDB is the KEGG [62] (Kyoto Encyclopedia of Genes and Genomes) pathway database. This database aims to link genomic information with higher order functional information by computerisation of current knowledge on cellular processes and by standardising gene annotations [63]. Within KEGG, genomic information is stored in the GENES database (a collection of gene catalogues), while higher order functional information is stored in the PATHWAY database. The WIT [64] (What is There) database is another example of an MPDB. WIT has been designed to extract functional content from genome sequences and organise it into a coherent system. It supports comparative analysis or sequenced genomes and generates metabolic reconstructions based on chromosomal sequences and metabolic modules from the Enzymes and Metabolic Pathways Database (EMP)*/Metabolic Pathways Database (MPW) family of databases. WIT provides a set of tools for the characterisation of gene structures and functions. After genes have been assigned initial functions, they are then ‘attached’ to pathways by choosing templates form the metabolic database (MPW) that best incorporate all observed functions. When this basic model has been created, a (human) curator evaluates this model against biochemical data and 180

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

phenotypes known from the literature. Textual and graphical representation are fully linked with underlying data. The Pathway Tools software [65, 66] constitutes an environment for creating a metabolic pathway database for a given organism or genome. Pathway Tools has three components: PathoLogic, which facilitates the creation of new pathway/genome databases from GenBank® entries; Pathway/Genome Navigator, for query, visualisation and analysis; and Pathway/Genome Editor, which provides interactive editing capabilities. Some of the computationally derived pathway/genome databases today are AgroCyc (Agrobacterium tumefaciens;

http://biocyc.org/AGRO/organism-summary?object=AGRO),

MpneuCyc

(Mycoplasma pneumoniae; http://biocyc.org/MPNEU/organism-summary?object=MPNEU), Human-Cyc (Homo sapiens; http://humancyc.org/); a more detailed list can be found at http://www.biocyc.org. Pathway data are beginning to become more abundant as a consequence of genomic sequencing, the spread of high-throughput analytical methods and the growth of systems biology. There is therefore a need to organise these data in a rational and formalised way (i.e. to model our knowledge of metabolic data). The first step necessarily relates to storage and recovery of information. The complexity of this type of data, and in particular the fact that some information is held in the relationship between the biological entities rather than in the entities themselves, complicates their selection and recovery. Further complication is added by the need to model higher level interactions, the model needs to be clearly delimited in advance. Moreover, our knowledge is often incomplete: elements may be missing, or pathways may be different or totally unknown in a newly sequenced organism (or indeed in unknown organisms [e.g. from environmental genomics projects or surveys]). Constructing databases that can provide inference mechanisms to assist the discovery process, based on incomplete information of this sort, remains a significant challenge.

181

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 6 - Figure 2. Valine biosynthetic pathway in Escherichia coli Illustrating the relationships among biological data types (reproduced from EcoCyc website, [68] with permission). The position of the transcription start site of ilvC has more recently been modified to reflect an updated version of the E. coli genome sequence, U00096.2

182

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Ideally, MPDBs should integrate information about the genome and the metabolic networks of a particular organism. The metabolic network can be described in terms of four bio-object types: the pathways that compose the network, the reactions that compose the pathway, metabolic compounds, and the enzymes that catalyze the reactions [42]. Literature citations are typically provided for most of the information units. However, very often this information is incomplete; extraction of information from GenBank® and PubMed in order to assist systematic annotation of gene functions is not a trivial process. Figure 2 exemplifies the relationships among these four biological data types.

6.7

SUMMARY, CONCLUSIONS AND UNSOLVED PROBLEMS

In this review we have argued that the integration of information in molecular bioscience (and, by extension, in other technical fields) is a deeper issue than access to a particular type of data (sequences, structures) or record (GenBank® accession number). It requires the technical (IT-level) integration among heterogeneous, probably federated, data sources, for which platforms such as Kleisli and DiscoveryLink® have been designed. For specific problems, point solutions such as SRS and GeneCards® provide sufficient connectivity to deliver a product to the end user. Graphical interfaces and related developments such as W2H and Pise support the use of specific tools, sometimes in combinations or pipelines. TAMBIS adds a conceptual model, a flexible logic set and an ontological frame of reference. XML is currently a formatting standard but has potential to be used in deeper ways. Metabolic pathway databases exemplify the role of higher order relationships as biologically relevant information. Ontologies have been used as conceptual models for query formulation in TAMBIS; myGrid makes more extensive use of existing ontology, not only as a unification factor but also as a means to establish automatic discovery of bioinformatics services. Semantic and syntactic issues are equally important, but defining the boundaries between them is not a simple task. Projects such as myGrid and BioMOBY require well 183

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

defined ontologies in order to support automatic discovery of services in bioinformatics. Capturing the semantics of biological data types is difficult, in part because they are highly contextual, but also because of the lack of expressivity of knowledge representation languages such as the OWL Web Ontology Language (http://www.w3.org/TR/owl-features). There is no solution for the old common question “What can I do with this piece of information?”, since, for a given data type, there may be more than a hundred services capable of dealing with it. Which one to use, and how to define workflows over it, depends on the context. By providing accurate descriptions and having a more detailed knowledge about the business logic behind research operations, more and more semi-autonomous agents with some intelligence may assist users in defining programmatic structures for their in silico experiments. Ontologies in bioinformatics have focused on descriptions of ‘things’, but very few of them actually describe research processes; moreover, no ontologies describe the relationship between the processes for studying the ‘thing’ and the thing itself. These types of ontologies may facilitate the implementation of the vision of the semantic web in bioinformatics. Information in the molecular biosciences is fragmented and disperse, yet highly interconnected, semantically rich and very contextual. Queries on these data can be highly conceptual, and the context is user-dependent and often highly subjective. All of these factors make data representation a substantial challenge. It is not obvious that concepts of, for example, retrieval efficiency developed in domains such as commerce or finance can be usefully applied with data in molecular bioscience. We are heading towards a semantic web in bioinformatics, but the role of ontologies and agent technologies in practical implementations has not yet been properly defined. The vision of the semantic web in bioinformatics remains fragmentary; current technology is far from providing real semantic capabilities even in domains such as word processing. For two words such as ‘purpose’ and ‘propose’, Microsoft® Word (for example) advises on syntactic issues, but no guidance is given about the context of the words. Semantic issues in bioinformatics workflows are more complex, and it is not clear whether the existing technology can effectively overcome these problems.

184

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

A relevant discussion within the bioinformatics community has addressed the disjunction between semi-structured data and fully structured data. Because of the nature of scientific data, and the processes the community undertakes to produce them, systems relying on semi-structured frameworks seem to be a better option. XML emerges then as a clear option because it provides mechanisms for self-description of the information it is representing. By providing semantically valid tags, XML documents make it easy to identify functional operations depending on the data type. Certain analytical methods may be automatically identified and presented to the users directly from the XML file, and identification of processes relevant to the different data types may be encompassed with automatic discovery of services. Ontologically grounded XML schemata, and complete XMLbased solutions, are not yet available within this community. A sequence or structure with no descriptors can be only an isolated and probably meaningless unit of information. It is through annotation that we can capture the way in which the role of a biological entity is to be understood in a given context. Complex queries go beyond the simple fact of mounting SQL queries over heterogeneous data sources; instead they involve context-dependent concepts and relations. Annotation plays a key role in representing and modeling this information precisely, because this is the way in which the role is to be understood in a given context. Representing how genes and proteins relate to the cell cycle, cell death, diseases, health status or more generally to any type of biological process requires not only functional annotation per se, but also the integration of relevant literature annotations. At this point, querying MBDBs, and mining or categorising literature, are distinct operations; no software tools are openly available that combine the two steps (e.g. by allowing users to retrieve relevant literature for a given sequence or structure query beyond that wich happens to be cited within the entries retrieved). Finally, analytical tools have not yet been fully integrated with indexing, data and workflow management systems. As discussed in Section 5, GUIs are available for a wide variety of implementations of diverse analytical methods, but we are far from having access to a unified, platform-independent analytical environment. One bioinformatics company, LION 185

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

bioscience, has taken some steps in this direction with its SRS version 6.0. We continue to believe, however, that real information integration in molecular bioscience requires a unified analytical and data-handling environment for users. A relevant analogy may be, for example, the diversity of operations available for a particular data type within the Windows® operating system environment; all of the possible operations are presented to the user via a contextual menu displayed at the time the user right-clicks on the icon of interest. In the same way, operations over biological data types should be identified in advance, presented to the user and then executed; currently all these operations are done either by coding them or by copy/paste procedures. Simplification of coding operations should be enabled within GUI frameworks (e.g. direct manipulation interfaces). We think that concepts from projects such as Haystack [67] should be more carefully considered in bioinformatics. Automation of data handling and knowledge extraction, along with tools that support the interpretation of extracted knowledge, are likewise not yet available to bioinformaticians. Such a set of tools should support the selection and planning of ‘wet’ experition ments in the laboratory. Computer models are increasingly used to complement laboratory experiments, and tools that extract and integrate knowledge would be powerful adjuncts to these models at all stages of their implementation and use. Biological knowledge is spread not only over many databases but also (and in a more complicated way) across thousands of papers, patents and technical reports. It is in these latter documents that facts are described in the context in which the underlying biological entities have been studied; real integration, therefore, should consider conceptual queries over fully integrated views of relevant data sources. Ideally, future biological information systems (BISs) will require neither frequent (and difficult) data and software updates nor local data integration (warehouses); they should allow semantically based data integration through ontologies (improving data integration) and should support monitoring of the evolution of information sources. Future BISs should also allow each researcher to ask questions within the context of his or her own problem domain (and between domains), unconstrained by local or external data repositories. They should proactively inform the user about new, relevant information, based on individual needs, and 186

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

support collaboration by matching researchers who have relevant expertise and/or interests. Achieving this level of integration – as data in the molecular biosciences continue inexorably to increase and diversify – will continue to provide challenges on many levels.

6.8

ACKNOWLEDGMENTS

We thank Dr. Limsoon Wong and the reviewers for extremely helpful suggestions. Financial support for ARC Discovery Project DP0342987 and the ARC Centre in Bioinformatics CE0348221 is acknowledged. The authors have no conflicts of interest that are directly relevant to the content of this review.

6.9

1.

REFERENCES

Sirotkin, K., NCBI: Integrated Data for Molecular Biology Research. 1999, Norwell, MA: Kluwer Academic Publishers.

2.

Karp, P.D. and S. Paley, Integrated access to metabolic and genomic data. Journal of Computational Biology, 1996. 3(1): p. 191-212.

3.

Keen, G., et al., The Genome Sequence DataBase (GSDB): Meeting the challenge of genomic sequencing. Nucleic Acids Research, 1996. 24(1): p. 13-16.

4.

Benson, D.A., et al., GenBank. Nucleic Acids Research, 1997. 25(1): p. 1-6.

5.

Brooksbank, C., et al., The European Bioinformatics Institute's data resources. Nucleic Acids Research, 2003. 31(1): p. 43-50.

6.

Bairoch, A. and R. Apweiler, The SWISS-PROT protein sequence database: Its relevance to human molecular medical research. Journal of Molecular MedicineJmm, 1997. 75(5): p. 312-316.

187

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

7.

Hubbard, T., et al., The Ensemble genome database project. Nucleic Acids Research, 2002. 30(1): p. 38-41.

8.

Spellman, P., et al., Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biology, 2002. 3(9): p. 0046.1-0046.9.

9.

Lord, P., et al. PRECIS: An automated pipeline for producing concise reports about proteins. in IEEE International Symposium on Bio-informatics and Biomedical engineering. 2001. Washington: IEEE press.

10.

Gaasterland, T. and C.W. Sensen, MAGPIE: Automated genome interpretation. Trends in Genetics, 1996. 12(2): p. 76-78.

11.

Etzold, T. and P. Argos, Transforming a Set of Biological Flat File Libraries to a Fast Access Network. Computer Applications in the Biosciences, 1993. 9(1): p. 59-64.

12.

Zdobnov, E.M., et al., The EBI SRS server - recent developments. Bioinformatics, 2002. 18(2): p. 368-373.

13.

Davidson, S., et al., BioKleisli: A digital library for biomedical researchers. International Journal of Digital Libraries, 1997. 1: p. 36--53.

14.

Wong, L., Kleisli, a Functional Query System. Journal of Functional Programming, 2000. 10(1): p. 19-56.

15.

Haas, L., et al., DiscoveryLink: A system for integrated access to life sciences data sources. IBM Systems Journal, 2001. 40: p. 489-511.

16.

Davidson, S.B., et al., K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. Ibm Systems Journal, 2001. 40(2): p. 512-531.

17.

Rebhan, M., et al., GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics, 1998. 14(8): p. 656-664.

18.

Stein, L.D., Integrating biological databases. Nature Reviews Genetics, 2003. 4(5): p. 337-345.

19.

Lacroix, Z., Biological Data Integration: Wrapping Data and Tools. IIIE Transactions on Information Technology in Biomedicine, 2002. 6(2): p. 123-128.

188

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

20.

Stoesser, G., et al., The EMBL Nucleotide Sequence Database: major new developments. Nucleic Acids Research, 2003. 31(1): p. 17-22.

21.

Sigrist, C., et al., PROSITE: A documented database using patterns and profiles as motif descriptors. Briefings in Bioinformatics, 2002. 3: p. 265-274.

22.

Chenna, R., SIR: a simple indexing and retrieval system for biological flat file databases. Bioinformatics, 2001. 17(8): p. 756-758.

23.

Macauley, J., H.J. Wang, and N. Goodman, A model system for studying the integration of molecular biology databases. Bioinformatics, 1998. 14(7): p. 575-582.

24.

Tatusova, T.A., I. Karsch-Mizrachi, and J.A. Ostell, Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics, 1999. 15(7-8): p. 536-543.

25.

Lewis, S.E., et al., Apollo: a sequence annotation editor. Genome Biology, 2002. 3(12): p. 1-14.

26.

Willkinson, M. and M. Links, BioMOBY: An Open Source Biological Web Services Proposal. Briefings in Bioinformatics, 2002. 3: p. 331 - 341.

27.

Stevens, R., J. Robinson, and C. Goble, myGrid: personalised bioinformatics on the information grid. Bioinformatics, 2003. 19: p. 302 - 304.

28.

Kashyap, V. and A. Sheth, Semantic similarities between objects in multiple databases. 1999, San Francisco: Morgan Kaufmann.

29.

Benson, D.A., et al., GenBank. Nucleic Acids Research, 1999. 27(1): p. 12-17.

30.

Attwood, T.K. and C.J. Miller, Which craft is best in bioinformatics? Computers & Chemistry, 2001. 25(4): p. 329-339.

31.

Okazaki, Y., et al., Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature, 2002. 420(6915): p. 563-573.

32.

Guarino, N. Some Ontological Principles for Designing Upper Level Lexical Resources. in the First International Conference on Language Resources and Evaluation. 1998. Granada, Spain.

33.

Gruber, T.R., A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 1993. 5(2): p. 199-220.

189

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

34.

Schulze-Kremer, S. OntoCell: An ontology of Cellular Biology. in Third Pacific Symposium on Biocomputing. 1998: AAAI Press.

35.

Stevens, R., bio-ontology page, in Bio-ontology, R. Stevens, Editor. 2005, http://www.cs.man.ac.uk/~stevensr/ontology.html: Manchester, UK.

36.

Blaschke, C., L. Hirschman, and A. Valencia, Information extraction in Molecular Biology. Briefings in Bioinformatics, 2002. 3: p. 154-165.

37.

Friedman, N.N. and C.D. Hafner, The State of the Art in Ontology Design: A Survey and Comparative Review. AI Magazine, 1997. 18: p. 53-74.

38.

Erdmann, M. and R. Studer. Ontologies as Conceptual Models for XML Documents. in 12th Workshop on Knowledge Acquisition, Modeling and Management (KAW-99). 1999. Banff, Canada.

39.

Köhler, J. and S. Schulze-Kremer, The Semantic Metadatabase (SEMEDA): Ontology Based Integration of Federated Molecular Biological Data Sources. In Silico Biology, 2002. 2: p. 0021.

40.

Ashburner, M., et al., Gene Ontology: tool for the unification of biology. Nature Genetics, 2000. 25(1): p. 25-29.

41.

Yeh, I., et al., Knowledge acquisition, consistency checking and concurrency control for Gene Ontology (GO). Bioinformatics, 2003. 19(2): p. 241-248.

42.

Karp, P.D., EcoCyc:The Resource and the Lessons Learned. Bioinformatics Databases and Systems, 1999: p. 47-62.

43.

Stevens, R., et al., TAMBIS: Transparent access to multiple bioinformatics information sources. Bioinformatics, 2000. 16(2): p. 184-185.

44.

Baker, P.G., et al. TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources. in Sixth Conference on Intelligent Systems for Molecular Biology, ISMB98. 1998. Montreal, Canada: ISMB.

45.

Wiederhold, G., Integration of Knowledge and Data Representation. IIIE Computers, 1992. 21: p. 38-50.

46.

Paton, N.W., et al. Query Processing in the TAMBIS Bioinformatics Source Integration System. in 1th Int. Conf. on Scientific and Statistical Database Management (SSDBM). 1999: IEEE Press.

190

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

47.

Klein TE, Chang JT, Cho MK, et al. Integrating genotype and phenotype information: an overview of the PharmGKB project. Pharmacogenomics J 2001; 1:167-70

48.

Rubin DL, Farhad S, Oliver DE, et al. Representing genetic sequence data for pharmacogenomics: an evolutionary approach using ontological and relational models. Bioinformatics 2002; 18: 207-15

49.

Wong, L., Technologies for Integrating Biological Data. Briefings in Bioinformatics, 2002. 3(4): p. 389-404.

50.

Bry, F. and P. Kröger, A Computational Biology Database Digest: Data, Data Analysis, and Data Management. International Journal of Distributed and Parallel Databases, 2003. 13: p. 7 - 42.

51.

Devereux, J., P. Haeberli, and O. Smithies, A Comprehensive Set of SequenceAnalysis Programs for the Vax. Nucleic Acids Research, 1984. 12(1): p. 387-395.

52.

Rice, P., I. Longden, and A. Bleasby, EMBOSS: The European molecular biology open software suite. Trends in Genetics, 2000. 16(6): p. 276-277.

53.

Senger, M., et al., W2H: WWW interface to the GCG sequence analysis package. Bioinformatics, 1998. 14(5): p. 452-457.

54.

Letondal, C., A Web interface generator for molecular biology programs in Unix. Bioinformatics, 2001. 17(1): p. 73-82.

55.

Ernst, P., K.H. Glatting, and S. Suhai, A task framework for the web interface W2H. Bioinformatics, 2003. 19(2): p. 278-282.

56.

Del Val, C., et al., PATH: a task for the inference of phylogenies. Bioinformatics, 2002. 18(4): p. 646-647.

57.

Del Val, C., K.H. Glatting, and S. Suhai, cDNA2Genome: A tool for mapping and annotating cDNAs. BMC Bioinformatics, 2003. 4: p. 39.

58.

Malay, K.B., SeWeR: a customizable and integrated dynamic HTML interface to bioinformatics services. Bioinformatics, 2001. 17: p. 577-578.

59.

Carver, T.J. and L.J. Mullan, Website update: A new graphical user interface to EMBOSS. Comparative and Functional Genomics, 2002. 3(1): p. 75-78.

191

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

60.

Karp, P.D., Pathway databases: A case study in computational symbolic theories. Science, 2001. 293(5537): p. 2040-2044.

61.

Schomburg, I., A. Chang, and D. Schomburg, BRENDA, enzyme data and metabolic information. Nucleic Acids Research, 2002. 30(1): p. 47-49.

62.

Ogata, H., et al., KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research, 1999. 27(1): p. 29-34.

63.

Kanehisa, M. and S. Goto, KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research, 2000. 28(1): p. 27-30.

64.

Overbeek, R., et al., WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Research, 2000. 28(1): p. 123125.

65.

Karp, P.D., et al., The MetaCyc database. Nucleic Acids Research, 2002. 30(1): p. 5961.

66.

Karp, P.D., S. Paley, and P. Romero, The Pathway Tools Software. Bioinformatics, 2002. 18: p. 225-232.

67.

Quan, D., D. Huynh, and D.R. Karger. Haystack: A Platform for Authoring End User Semantic Web Applications. in 2nd International Semantic Web Conference. 2003. Sanibel Island, Florida: Springer-Verlag, Heidelberg.

68.

EcoCyc. E. coli K-12 pathway: valine biosynthesis [online]. Available from URL: http://biocyc.org/ECOLI/new-image?type=PATHWAY&object=VALSYN PWY [Accessed 2005 Sep 30]

192

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Workflows in bioinformatics: meta-analysis and prototype implementation of a workflow generator In the previous chapter it has been argued that with a wider application of ontologies and Semantic Web technologies in bioinformatics it will be possible to overcome some issues when integrating information. Workflows are identified as a fundamental component when integrating information in molecular biosciences, as researchers need to interleave information access and algorithm execution in a problem-specific workflow. Within this problem-specific workflow there are syntactic issues as well as semantic ones. Allowing the concrete execution of the workflow is a syntactic problem, describing this in silico experiment is, however, a semantic one for which an ontology of similar characteristics as those presented in chapter 5 section 5.2, is required. The benefit of having well defined syntactic and semantics not only easies some technical aspects, but also allows for better reusability of the workflow in a larger context – a community of users. This chapter addresses the problem of workflows in bioinformatics, more specifically supporting workflows for the Pasteur Institute Software Environment (PISE). Both syntactic and semantics, are investigated. From this meta-analysis, syntactic structures and algebraic operators common to many workflows in bioinformatics were identified. The workflow components and algebraic operators can be assimilated into re-usable software components. Semantic issues were also investigated; the MGED-RSBI ontology was adapted for this specific set of biological investigations. Other semantic aspects when developing workflows systems were explored. GPIPE, a prototype implementation of this framework, provides a GUI builder to facilitate the generation of workflows and integration of heterogeneous analytical tools. Alex Garcia was responsible for the conceptualisation, initial investigation and finalisation of the research described in this chapter. Alex Garcia conceived the workflow 193

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

generator, graphical user interface, and semantic structures. He also participated in the development of the tool, and wrote the corresponding papers.

AUTHORS' CONTRIBUTIONS Alex Garcia Castro was responsible for design and conceptualisation, took part in implementation, and wrote a first draft of the manuscript. Samuel Thoraval was the main developer of G-PIPE. Leyla Jael Garcia Castro assisted with server issues and FCA. Mark A. Ragan supervised the project and participated in writing the manuscript.

PUBLISHED PAPER ARISING FROM THIS CHAPTER Garcia Castro A, Thoraval S, Garcia LJ, Chen Y-PP, Ragan MA: Bioinformatics workflows: G-PIPE as an implementation. In: Network Tools and Applications in Biology (NETTAB), 5-7 October 2005, Naples, Italy, pages 61-64 Garcia Castro A, Thoraval S, Garcia LJ, Ragan MA: Workflows in bioinformatics: meta-analysis and prototype implementation of a workflow generator. BMC Bioinformatics 6:87.

194

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

7

Chapter VII - Workflows in bioinformatics: meta-analysis and prototype implementation of a workflow generator

Abstract. Computational methods for problem solving need to interleave information access and algorithm execution in a problem-specific workflow. The structures of these workflows are defined by a scaffold of syntactic, semantic and algebraic objects capable of representing them. Despite the proliferation of GUIs (Graphic User Interfaces) in bioinformatics, only some of them provide workflow capabilities; surprisingly, no meta-analysis of workflow operators and components in bioinformatics has been reported. We present a set of syntactic components and algebraic operators capable of representing analytical workflows in bioinformatics. Iteration, recursion, the use of conditional statements, and management of suspend/resume tasks have traditionally been implemented on an ad hoc basis and hard-coded; by having these operators properly defined it is possible to use and parameterise them as generic re-usable components. To illustrate how these operations can be orchestrated, we present G-PIPE, a prototype graphic pipeline generator for PISE that allows the definition of a pipeline, parameterisation of its component methods, and storage of metadata in XML formats. This implementation goes beyond the macro capacities currently in PISE. As the entire analysis protocol is defined in XML, a complete bioinformatics experiment (linked sets of methods, parameters and results) can be reproduced or shared among users. Availability: http://ifweb1.imb.uq.edu.au/Pise/5.a/gpipe.html (interactive), ftp://ftp.pasteur.fr/pub/GenSoft/unix/misc/Pise/ (download). From our meta-analysis we have identified syntactic structures and algebraic operators common to many workflows in bioinformatics. The workflow components and algebraic operators can be assimilated into re-usable software components. G-PIPE, a prototype implementation of this framework, provides a GUI builder to facilitate the generation of workflows and integration of heterogeneous analytical tools.

7.1

BACKGROUND

Computational approaches to problem solving need to interleave information access and algorithm execution in a problem-specific workflow. In complex domains like molecular biosciences, workflows usually involve iterative steps of querying, analysis and optimisation. Bioinformatics experiments are often workflows; they link analytical methods that typically accept an input file, compute a result, and present an output file. Most tool-driven integration approaches have so far addressed the problem of providing a single GUI for a set of analytical methods. Combining methods into a flexible framework is usually not considered. Analytical workflows provide a path to discover information beyond the capacities of simple query statements, but are much less easy to implement within a common environment. 195

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Workflow management systems (WFMS) are basically systems that control the sequence of activities in a given process [1]. In molecular bioscience, these activities can be divided among those that address query formulation, and those that focus more on analysis. At this abstract level, WFMS could serve to control the execution of both query and analytical procedures. All of these procedures involve the execution of activities, some of them manual, some automatic. Dependency relationships among them can be complex, making the synchronisation of their execution a difficult problem. One dimension of the complexity of workflows in molecular biosciences is given by the various transformations performed on the data. Syntactic (operational) interoperability establishes the possibility for data to be piped from one method into another. Semantic issues (another dimension) arise from the fact that we need to separate domain knowledge from operational knowledge. We should be able to describe a task of configuring a workflow from its primary components according to a required specification, and implement a program that realises this configuration independently of the workflow and components themselves. Biologists provide rich descriptions of their experiments (materials and methods) so they can be easily replicated. Once techniques have been standardised, usually this knowledge is encapsulated in the form of an analytical protocol. With in silico experiments as well, analytical protocols make it possible for experiments to be replicated and shared, and (via meta-information) for the knowledge behind these workflows to be captured. These protocols should be reproducible, ontology-driven, internally accurate, and annotated externally. Systems such as W2H/W3H [2] and PISE [3] provide some tools that allow methods to be combined. W3H is a task framework that allows the methods available under W2H [4] to be integrated; however, those tasks have to be hardcoded. In the case of PISE, the user can either define a macro using Bioperl http://www.bioperl.org, or use the interface provided and register the resulting macro. In either case, it is assumed that the user can program, or script in Perl. Macros cannot be exchanged between PISE and W2H, although these two systems provide GUIs for more or less the same set of methods (EMBOSS [5]). Indeed, macros 196

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

cannot be easily shared even among PISE users. Biopipe http://www.biopipe.org, on the other hand, provides integration for some analytical tools using Bioperl API (Application Programming Interface) using MySQL to store results as well as the workflow definition; in this way, users are able to store results in MySQL and monitor the execution of the predefined workflow. The TAVERNA http://taverna.sourceforge.net project provides similar capabilities to those offered by G-PIPE. However, on one hand inclusion of new analytical methods is not currently possible since no GUI generator is provided, and on the other hand as TAVERNA is part of myGrid [6] it follows a different integrative approach (Web Services). Pegasys [7] is a similar approach, going beyond analytical requirements and providing database capacities. G-PIPE provides a real capacity for users to define and share complete analytical workflows (methods, parameters, and meta-information), substantially mitigating the syntactic complexity that this process involves. Our approach addresses overall collaborative issues as well as the physical integration of tools. Unlike TAVERNA, G-PIPE provides an implementation that builds on a flexible syntactic structure and a set of algebraic operations for analytical workflows. The definition of operators as part of the workflow description allows a flexible set-up when executing it; operators also facilitate the reproducibility of the workflow as they allow researchers to share experimental conditions in the form of parameters. Although G-PIPE was not conceived as an environment for testing usability aspects in the design of bioinformatics tools, empirical observations allowed us to see how the disposition of the functional objects in the interface (e.g. interfaces to algorithms and the workflow representation) was simpler and easier for researchers than in the one provided by TAVERNA. An important issue that was raised from these observations was the high-level of complexity involved in the parameterisation, as researchers usually run algorithms with default settings. Unlike G-PIPE, TAVERNA assumes users have an understanding of web services, part of the necessary steps when defining a workflow in TAVERNA involves the selection of the algorithm as a web service. Another interesting aspect we could observe was the 197

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

importance of having a tool in which fewer steps were involved in the definition and execution of the workflow. TAVERNA requires too many details and involves too many steps when defining and executing the workflow; some of the required information is technical and thus more related to the operational domain then to the domain of knowledge; this causes an unnecessary stressing factor in the researcher. Surprisingly, there are no usability methods for bioinformatics nor is there usability studies performed throughout the software development process in bioinformatics; the application of usability engineering could potentially benefit the development of bioinformatics tools by bringing them closer to the needs of end-users. The facility provided by G-PIPE for the generation of the workflow aims to hide the complexity of the workflow by allowing researchers to concentrate on the minimal necessary procedural details of the workflow (e.g. input files, parameters, where to pipe). For testing purposes we provide a simple example of a workflow (inference of a phylogeny of rodents) that involves piping among three methods. Although here their execution takes place on a common server, it is equally possible to distribute the process over a grid using G-PIPE. The definition of the workflow as well as the corresponding input files are available at Appendix 5.

7.2

RESULTS

Our workflow follows a task-flow model; in bioinformatics, tasks can be understood as analytical methods. If workflow models are represented as a directed acyclic graph (DAG), analytical methods then appear as nodes, and state information is represented as conditions attached to the edges. Our syntactic structure and algebraic operators can be used to represent a large number of analytical workflows in bioinformatics; surprisingly, there are no other algebraic operators reported in the literature capable of symbolising the different operations required for analytical workflows in bioinformatics (or, indeed, more broadly in e-science, although they are widely used in the analysis of business processes). Different groups have 198

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

developed a great diversity of GUIs for EMBOSS and GCG, but a meta-analysis of the processes within which these analytical implementations are immersed is not yet fully available. Some of the existing GUIs have been developed to make use of grammatical descriptions of the analytical methods, but there exists no standard meta-data framework for GUI and workflow representation in bioinformatics.

Chapter 7 - Figure 1. Syntactic components describing bioinformatics analysis workflows.

7.2.1

Syntactic and algebraic components Our workflow conceptualisation (Figure 1) closely follows those of Lei and Singh [8]

and Stevens et al. [9]. We have adapted these meta-models to processes in bioinformatics analysis. We consider an input/output data object as a collection of input/output data. For us a transformer is the atomic work item in a workflow. In analytical workflows, it is an implementation of an analytical algorithm (analytical method). A pipe component is the entity that contains the required input-output relation (e.g. information about the previous and subsequent tasks); it assures syntactic coherence. Our workflow representation has tasks, stages, and experimental conditions (parameters). In our view, protocols are sets of information that describe an experiment. A protocol contains workflows, annotations, and information about the raw data; therefore we understand a workflow to be a group of stages with interdependencies. It is a process bound to a particular resource that fulfils the analytical necessities. 199

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

We identify needs common to analytical workflows in bioinformatics: • Flexibility in structuring and modelling (open-ended, sometimes ad hoc workflow definition, allowing decisionmaking whilst a workflow is being executed). • Support for workflows with a complex (or nested) inner structure of individual steps (such that multi-level modeling becomes appropriate). Biological workflows may be complex not simply because of the discrete number of steps, but due to the highly nested structure of iteration, recursion and conditional statements that, moreover, may involve interaction with non-workflow systems. • Distribution of workflow execution over grid environments. • Management of failures. This particular requirement is related to conditional statements: where the service will be executed should be evaluated based on considerations of availability and efficiency made previous to the execution of the workflow. In situations where a failure halts the process, the system should either recover it, or dispatch it somewhere else without requiring intervention by the user. • System functionality features such as browsing and visualisation, documentation, or coupling with external tools, e.g. for analysis. • A semantic layer for collaborative purposes. This semantic layer has many other features, and may be the foundation for intelligent agents that facilitate collaborative research.

Chapter 7 - Figure 2. Syntactic components and algebraic operators.

200

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

OPERATOR: ITERATION (I) OPERATOR: RECURSION (R) OPERATOR: CONDITION (C) OPERATOR: SUSPENSION/RESUMPTION (S)

I [TRANSFORMER](CC1, CC2...CCN) R [TRANSFORMER: PARAMETER](PARM_SPACE) C [FUNCTIONAL CONDITION (TRUE: PATH; FALSE: PATH; VALUE: PATH)] S [RE-TAKE JOBS: EXECUTION]

Chapter 7 - Table 1. Algebraic operators

Executing these bioinformatics workflows further requires: • Support for long-running activities with or without user interaction. • Application-dependent correctness criteria for execution of individual and concurrent workflows. • Integration with other systems (e.g. file managers, database management systems, product data managers) that have their own execution/correctness requirements. • Reliability and recoverability with respect to data. • Reliable communication between workflow components and processing entities. Among these types of requirements, we focus our analysis only on those closely related to workflow design issues, more specifically (a) the piping of data, (b) the availability of conditional statements, (c) the need to iterate one method over a set of different inputs, (d) the possibility of recursion over a parameter space for a method, (e) and the need for stop/break management. Algebraic operators can accurately capture the meaning of these functional requirements. To describe an analytical workflow, it is necessary to consider both algebraic operators and syntactic components. In Table 2 we present the definition of those algebraic operators we propose and in Figure 2 we illustrate how these operators and syntactic elements together can describe an analytical workflow. Iteration is the operator that enables processes in which one transformer is applied over a multiple set of inputs. A special case for this operator occurs when it is applied over a blank transformer; this case results in replicates of the input collection. Consider an analytical method, or a workflow, in which the same input is to be used several times; the first step would be to use as many replicates of the input collection as needed. The recursion operation takes place when one transformer is applied with parameters defined not as a single value, but as a range or as a set of values. The conditional operator has to do with the conditioned execution of transformers. This operation can be attached to a function evaluated over the 201

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

application of a recursion, or of an iteration; if the stated condition is true, then the workflow executes a certain path. Conditional statements may also be applied to cases where an argument is evaluated on the input; the result affects not a path, but the parameter space of the next stage. The suspension/resumption operation stands for the capacity of the workflow to stop and re-capture the jobs. Formal Concept analysis (FCA) is a mathematical theory based on ordered sets and complete lattices. Numerous investigations have shown the usefulness of concept lattices for information retrieval combining query and navigation, learning and data-mining, visual constructors and visual programming [10]. FCA helps one to define valid objects, and identify behaviours for them. We are currently working on a complete FCA for biological data types and operations (database and analytical). Here we define operators in terms of pre- and postconditions, as a step toward eventual logical formalisation. We focus on those components of the discovery process not directly related to database operations; a good integration system will "hide" the underlying heterogeneity, so that one can query using a simple language (which views all data as if they are already in the same memory space). Selection of the query language depends only on the data model. For the XML "data model", XML-QL, XQL, and other XML query languages are available. For the nested relational model there are nested relational calculi and nested relational algebras. For the relational model SQL, relational algebras and so on are available. For database operations, the issues that arise are lower-level (e.g. expression of disk layout, latency cost, etc. in the context of query optimisation), and it is not clear that any particular algebra offers a significant advantage. A more-detailed example involves the inference of molecular phylogenetic trees by executing software that implements three main phylogenetic inference methods: distance, parsimony and maximum likelihood. Figure 3 illustrates how our algebraic operators and syntactic components define the structure of this workflow.

202

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Operator: Iteration (I): I[Transformer, (CC1, CC2, …, CCn)]: (CC1’, CC2’, …, CCn’) Pre-condition: T = Transformer T ! blank C = {CC1, CC2, …, CCn} such that CCi {Biological data types} Post-condition: C’ = {CC1’, CC2’, …, CCn’} such that CCi’ = T(CCi) 1 " i " n Operator: Iteration (I): I[blank: num, (CC1, CC2, …, CCn)]: (CC1, CC2, …, CCn)1 , (CC1, CC2, …, CCn)2, … , (CC1, CC2, …, CCn)num Pre-condition: num , num = number of replicates C = {CC1, CC2, …, CCn} such that CCi {Biological data types} Post-condition: C’ = {CC1’, CC2’, …, CCn’} such that CCi’ = T(CCi) 1 " i " n Operator: Recursion (R): R[Transformer: Parameter, (Parm_Space)]: Parm_Space’ Pre-condition: P = Parameter such that P Parm_Space (Parm_Space = {Parm_Values}) T = Transformer Post-condition: Parm_Space’ = T(Parm_Space) Operator: Condition (C): C[Functional_Condition]: PATH Pre-condition: FC = Functional_Conditional Post-condition: PATH = true false false Operator: Suspension/Resumption (S): S[re-take, jobs]: Execution Pre-condition: (re-take = true) (re-take = false jobs = Set of jobs which should be suspended) Post-condition: (re-take = true ((Execution = true Previously suspended jobs are re-taken) (Execution = true There were no suspended jobs))) (re-take = false (Execution = true j such that j jobs, j is suspended)) Chapter 7 - Table 2. Operator specifications.

203

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 7 - Figure 3. Phylogenetic analysis workflow

In collaboration with CIAT (International Center for Tropical Agriculture, Cali, Colombia) we have implemented an annotation workflow using standard technology (GPIPE/PISE) and web services (TAVERNA). Our case workflow is detailed in Figure 4. Implementation of both of these workflows was a manual process. GUI generation was facilitated by using PISE as our GUI generator, and this simplified the inclusion of new analytical methods as needed. Database calls had to be manually coded in both cases. Choreographing the execution of the workflow was not simple, as neither has a real workflow engine. It proved easier to give users the ability to manipulate parameters and data with PISE/G-PIPE, partly due to wider range of methods within BioPerl partly because algebraic operators were readily available as part of PISE/G-PIPE. From this experience we have concluded that, due to the immaturity of current available web service engines, it is still most practical to implement simple XML workflows that allow users to manipulate parameters, use conditional operators, and carry out write and read operations over databases. This balance 204

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

will, of course, presumably shift as web services mature in the bioinformatics applications domain.

Chapter 7 - Figure 4. Case workflow

7.2.2

Workflow generation, an implementation We have developed G-PIPE, a flexible workflow generator for PISE. G-PIPE extends

the capabilities of PISE to allow the creation and sharing of customised, reusable and shareable analytical workflows. So far we have implemented and tested G-PIPE over only the EMBOSS package, although extension to other algorithmic implementations is possible where there is an XML file describing the command-line user interface. Workflows automate businesses procedures in which information or tasks are passed between conforming entities according to a defined set of rules; some of these business rules are defined by the user, and in our implementation are managed via G-PIPE. For our purposes, the conforming entities are analytical methods (Clustal, Protpars, etc.). Syntactic 205

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

rules drive the interaction between these entities (e.g. to ensure syntactic coherence between heterogeneous file formats). G-PIPE also assures the execution of the workflow, and makes it possible to distribute different jobs over a grid of servers. G-PIPE addresses these requirements using mostly Bioperl. In G-PIPE, each analysis protocol (including any annotations, i.e. meta-data) is defined within an XML file. A Java applet provides the user with an exploratory tool for browsing and displaying methods and protocols. Synchronisation is maintained between client-side display and server-side storage using Javascript. Server-side persistence is maintained through serialised Perl objects that manage the workflow execution. G-PIPE supports independent branched tasks in parallel, and reports errors and results into an HTML file. The user selects the methods, sets parameters, defines the chaining of different methods, and selects the server(s) on which these will be executed. G-PIPE creates an XML file and a Perl script, each of which describes the experiment. The Perl file may later be used on a command-line basis, and customised to address specific needs. The user can monitor the status of workflow execution, and access intermediary results. A workflow built with G-PIPE can distribute its analyses onto different, geographically disperse G-PIPE/PISE servers.

7.3

ARCHITECTURAL DETAILS

The overall architecture of G-PIPE is shown in Figure 5. A Java applet provides the user with an exploratory tool for browsing and displaying methods and protocols. The user interacts with the HTML forms to define a protocol. Synchronisation is maintained between client-side display and server-side storage using Javascript. Server-side persistency is maintained through serialised Perl objects that describe the experiment. The object is translated into two user-accessible files: an XML file to share and reload protocols, and a Perl script. A new lightweight PISE/Bioperl module, PiseWorkflow, lets workflows to be built and run atop PiseApplication instances. This module supports independent branched tasks in parallel, and report errors and results into an HTML file. 206

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 7 - Figure 5. G-PIPE Architecture.

7.4

SEMANTIC AND SYNTACTIC ISSUES

The representation of the workflow, as illustrated in Appendix 5, is grounded in both semantic and syntactic elements. The semantics encoded within the proposed XML makes it easier for developers to understand the meaning of each element in the XML file. This chapter proposes a set of valid workflow constructs for bioinformatics. These are technically valid not only because they have been identified as common to many bioinformatics workflows, but also because they have are structured in an XML file within a well-encoded semantics. This facilitates the incorporation of the constructs to larger efforts such as the RSBI -see chapter 5, section 2. Within the context of a biological investigation, for which bioinformatics might facilitate the design of SNPs (Single Nucleotide Polymorphism) primers, there is an associated 207

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

computational workflow, as illustrated in Figure 6. In this case, syntactic elements have been defined by a clear semantics that allows developers to manipulate the constructs depending on the needs of the application; thus there is a semantics scaffold from which syntactic aspects make sense. For the case of G-PIPE it is enough just to allow users to manipulate parameters, transformers, pipe components, and data collections. However, when annotating complete biological investigations the design of SNPs, or any computational method involved, is just a small part within the context of a larger effort. For these cases the annotation is not only about those identified constructs; the workflow is part of the whole, then, the workflow constructs have to be annotated within the new context.

Chapter 7 - Figure 6. Designing SNPs.

208

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

The RSBI ontology, in principle, allows this integration. The collection component as understood by Garcia et al [1], and discussed in section 7.2 can be assimilated to the same concept of Biomaterial. The transformer can be assimilated to the assay. Figure 7 illustrates how, for a particular segment of the workflow presented in Figure 6, the RSBI ontology together with the workflow constructs; represent in a meaningful way the use of TBLAST. It is important to notice that the larger the effort, the more complex the annotation. Biomaterial is an elusive concept, as for every assay, being it computational, in vivo, or in vitro, there is the potential to fragment or even mutate – transform - the biomaterial; however there is always the need to trace back the sample to its original source allowing researchers to inspect the process at different levels of detail.

Chapter 7 - Figure 7. Mapping the RSBI

209

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

7.5

DISCUSSION

The syntactic and algebraic components we introduce above make it possible to describe analytical workflows in bioinformatics precisely yet flexibly. Detailed algebraic representation for these kinds of processes have not previously been used in this domain, although they are commonly used to represent business processes. Since open projects such as Bioperl or Biopipe contain the rules and logic for bioinformatics tasks, we believe that having an algebraic representation could contribute importantly to the development of a biological "language" that allows developers to avoid the tedious parsing of data and analytical methods so common in bioinformatics. The schematic representation for workflows in bioinformatics that we present here could evolve to cover other tool-driven integrative approaches such as those based on web services. Workflows in which concrete executions take place over a grid of web services involve basically the same syntactic structure and algebraic operators; however, a clear business logic needs to be defined beforehand for those web services in order to deepen the integration beyond simply the fact of remote execution. A higher level of sophistication for the pipe component as well as for the conditional operator may be needed, since remote execution requires (for example) assessment and availability of the service for the job to be successfully dispatched and processed. For our implementation we use two agents, one on the client side and the other on the server side, with the queue handled by PBS (Portable Batch System). It is possible to add a semantic layer, thereby allowing conceptual selection of the transformers; clear separation between the operational domain and the knowledge domain be would then be achieved naturally. Semantic issues are particularly important with these kinds of workflows. An example may be derived from Figure 3, where three different phylogenetic analysis workflows are executed. These may be grouped as equivalent, but are syntactically different. Selection should be left in the hands of the user, but the system should at least inform about this similarity.

210

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Despite agreement on the importance of semantic layers for integrative systems, such a level of sophistication is far from being achieved. Lack of awareness of the practical applications of such technologies is well illustrated with a traditional and well-studied product: Microsoft Word®. With Word, syntactic verification can take place as the user composes text, but no semantic corroboration is done. For two words like "purpose" and "propose", Word advises on syntactic issues, but gives no guidance concerning the context of the words. Semantic issues in bioinformatics workflows are more complex, and it is not clear if existing technologies can effectively overcome these problems. Transformers and grid components are intrinsically related because the services are de facto linked to a grid component. It has been demonstrated that the use of ontologies facilitates interoperability and the deployment of software agents [11]; correspondingly, we envision semantic technology supporting the agents to form the foundation of future workflow systems in bioinformatics. The semantic layer should make the agents more aware of the information. More and more GUIs are available in bioinformatics; this can be seen in the number of GUIs for EMBOSS and GCG alone. Some of them incorporate a degree of workflow capability, more typically a simple chaining of analytical methods rather than flexible workflow operations. A unified metadata model for GUI generation is lacking in the bioinformatics domain. Web services are relatively easy to implement, and are becoming increasingly available as GUI systems are published as web services. However, web services were initially developed to support processes for which the business logic is widely agreed upon, well-defined and properly structured, and the extension of this paradigm to bioinformatics may not be straightforward. Automatic service discovery is an intrinsic feature of web services. The accuracy of the discovery process necessarily depends on the ontology supporting this service. Systems such as BioMoby and TAVERNA make extensive use of service discovery; however, due to the difficulty in describing biological data types, service discovery is not yet accurate. It is not yet clear whether languages such as OWL can be developed to describe relations between 211

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

biological concepts with the required accuracy. Integrating information is as much a syntactic as a semantic problem, and in bioinformatics these boundaries are particularly ill defined. Semantic and syntactic problems were also identified from the case workflow described in Figure 3. There, we saw that to support the extraction of meaningful information and its presentation to the user, formats should be ontology-based and machine-readable, e.g. in XML format. Lack of these functional features makes manipulation of the output a difficult task that is usually addressed by use of parsers specific to each individual case. For workflow development, human readability can be just as important. Consider, for example, a ClustalW output where valid elements could be identified by the machine and presented to the user together with contextual menus including different options over the different data types. In this way the user would be able to decide what to do next, where to split a workflow, and over which part of the output to continue or extend the analysis. Inclusion of this functionality would allow the workflow to become more concretely defined as it is used. Failure management is an area in which we can see a clear difference between the business world and bioinformatics. In the former, processes rarely take longer than an hour and are not so computationally intensive, whereas in bioinformatics, processes tend to be computationally intensive and may take weeks or months to complete. How failures can be managed to minimise losses will clearly differ between the two domains. Due to the immaturity of both web services and workflows in bioinformatics, it is still in most cases more practical to hard-code analytical processes. Improved failure management is one of the domain-specific challenges that face the application of workflows in bioinformatics.

212

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Chapter 7 - Figure 8. G-PIPE. http://if-web1.imb.uq.edu.au/Pise/5.a/gpipe.html http://gene3.ciat.cgiar.org/Pise/5.a/gpipe.html

So far we have intentionally referred to GUIs and workflows as more-or-less independent. A glimpse into the corresponding metadata reveals that GUIs are themselves components of workflow systems. In the bioinformatics domain this relationship is particularly attractive, since algebraic operations are usually highly nested. The interface system should therefore provide a programming environment for non-programmers. The language as such is not complex, but makes extensive use of statements such as while...do, if...then...else, and for...each. The representation should be natural to the researcher, separating the knowledge domain from the operational domain.

213

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

7.6

CONCLUSION

We have developed G-PIPE, a flexible workflow generator that makes it possible to export workflow definitions either as XML or Perl files (which can later be handled vi a the Bioperl API). Our XML workflow representation is reusable, execution and edition of those generated workflows is possible either via the BioPerl API or the provided GUI. Each analysis is configurable, as users are presented with options to manipulate all available parameters supported by the underlying algorithms. Integration of new algorithms, and Grid execution of workflows, are also possible. Most available integrative environments rely on parsers or syntactic objects, making it difficult to integrate new analytical methods into workflow systems. We are planning to develop a more wide-ranging algebra that includes query operations over biological databases as well as different ontological layers that facilitate data interoperability and integration of information where possible for the user. We do not envision G-PIPE to be a complete virtual laboratory environment; future releases will provide a content management system for bioinformatics with workbench capacities developed on top of ZOPE http://www.zope.org. We have tested our implementation over SUSE and Debian Linux, and over Solaris 8.

7.7

ACKNOWLEDGEMENTS

We gratefully acknowledge the collaboration of Dr Fernando Rodrigues (CIAT) in developing the case study outlined in Figure 4, and Dr Lindsay Hood (IMB) for valuable discussions. ST thanks Université Montpellier II for travel support. This work was supported by ARC grants DP0344488 and CE0348221.

214

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

7.8

1.

REFERENCES

Hollingsworth D: The workflow reference model. [http://www.wfmc.org /standards /docs/tc003v11.pdf].

2.

Ernst P, Glatting K-H, Shuai S: A task framework for the web interface W2H. Bioinformatics 2003, 19:278-282.

3.

Letondal C: A Web interface generator for molecular biology programs in Unix. Bioinformatics 2001, 17:73-82.

4.

Senger M, Flores T, Glatting K-H, Ernst P, Hotz-Wagenblatt A, Suhai S: W2H: WWW interface to the GCG sequence analysis package. Bioinformatics 1998, 14:452-457.

5.

Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16:276-277.4.

6.

Stevens R, Robinson AJ, Goble C: myGrid: personalised bioinformatics on the information grid. Bioinformatics 2003, 19:302i-304i.

7.

Shah SP, He DYM, Sawkins JN, Druce JC, Quon G, Lett D, Zheng GXY, Xu T, Ouellette BFF: Pegasys: software for executing and integrating analyses of biological sequences. BMC Bioinformatics 2004, 5:40.

8.

Lei K, Singh M: A comparison of workflow meta-models. Workshop on behavioural modelling and design transformations: Issues and opportunities in conceptual modelling. Los Angeles 1997. ER'97, 6–7 November 1997

9.

Stevens R, Goble C, Baker P, Brass A: A classification of tasks in bioinformatics. Bioinformatics 2001, 17:180-188.

10.

Ganter B, Kuznetsov SO: Formalizing hypothesis with concepts. In 8th International Conference on Conceptual Structures, ICCS Conceptual Structures: Logical, Linguistic, and Computational Issues. Darmstadt, Germany. Lecture Notes in Computer Science 1867 Edited by: Mineau G, Ganter B. Springer-Verlag; 2000:342-356. August 14–18 2000

11.

Sowa FJ: Top-level ontological categories. International Journal of Human Computer Studies 1995, 43:669-685.

215

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

8 8.1

Conclusions and Discussion

SUMMARY

This thesis has primarily dealt with the actual how to build ontologies, and how to gain the involvement of the community. This research presents a detailed description of three different but complementary ontology developments, along with all those issues surrounding the development of the corresponding ontologies. As ontologies are living systems, constantly evolving, the maintenance and life cycle of the ontology has also been investigated in order to have a consistent methodology. It has been largely accepted by the biological community that ontologies play a prominent role when integrating information, however very few studies had focused on the relationship between the syntactic structure and the semantic scaffold. This thesis has also explored this relationship. The introductory chapters have investigated methodological aspects about building ontologies; these have ranged from the role of the domain expert to similarity between the biological and the semantic web scenario. Within this context the role of concept maps during knowledge elicitation when building conceptual models has been established. Other aspects related to the use of concept maps have also been reported ( e.g. argumentative structure, maintenance). As integration of information has different edges, this present work has covered the workflow nature of bioinformatics. Within this context a syntactic structure was proposed in order to allow in silico experiments to be replicable and reproducible. Also, and more importantly, from this experience it was possible to study the relationship between a syntactic and semantics. This research is based upon real cases in which researchers were involved; this allowed the author to gain from this direct relationship not only with the subject of study but also with the context in which solutions were expected to play a role.

216

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

The discussion and conclusions are organised as follows; initially a summary of the thesis is presented. In Section 6.2 and 6.3 the similarity between the semantic web and that of biology is illustrated within the context of information systems. Issues related to the construction of biological ontologies are discussed in Section 6.4 and, finally, references are given in Section 6.5.

8.2

BIOLOGICAL INFORMATION SYSTEMS AND ONTOLOGIES.

In this investigation we have argued that the integration of information in molecular bioscience (and, by extension, in other technical fields) is a deeper issue than access to a particular type of data (sequences, structures) or record (GenBank® accession number). Integration of information in bioinformatics has to support research endeavors in such a way that it facilitates the formulation and testing of biological hypothesis. For instance, a biological hypothesis may state that “genes from the Yellow Stripe Like (YSL) family may be used for the fortification of rice grains as they are responsible for the uptake and long-distance transportation of iron-chelates in rice” within a bio-fortification project. In this context not only information about genes, proteins and metabolic pathways is needed. Researchers also need to correlate all the information they have and can access through the Internet. In this way a relationship between YSL genes and the concentration of iron in rice grains may be found, and consequently tested in a laboratory. The connection between Laboratory Information Management Systems (LIMS) and external information is thus critical. LIMS are a special kind of biological information systems as they in principle organise the information produced by laboratories. Once this information has been organised the analysis process takes place, discovering relations becomes more and more important. Within the plant context, plant-related descriptors such as those provided by Plant Ontology (PO) [1] and Gramene [2] are being consumed by object models in a variety of software systems such as LIMS in order to support annotation of sequences and experiments. These object models

217

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

are meant to support an integrative approach so therefore the use of orthogonal ontologies is essential. Relating phenotypic information to its corresponding genotypes and vice versa should, in principle, be possible. For instance, with a saline stress related query one should be able to retrieve not only sequences but experimental designs and conditions, location, morphological features of the plants involved, etc. Common biological descriptors should be identified and ontologies addressing specific needs for the plant community need to be developed. Molecular information may be described independently from the domain, for instance the Gene Ontology (GO) [3], however, phenotypic information is highly specific to the type of organism being described. Ideally LIMSs should consume core and domain-specific terminology in order to allow for the annotation of experiments; these vocabularies should be shared across the community so exchanging information might be a simpler task. In order for information to be shared the vocabulary used should be independent from the LIMS; different LIMS should be able to share a standard vocabulary. This ensures the independence between both the conceptual and the functional model – researchers may use different LIMS but still name things with a consistent vocabulary. In the same vain this may allow to share experiments in the form of customizable “templates”. Accurate annotation of experimental processes and their results with well-structured ontologies allows for the semantic integration and querying across disparate data sets and data types. This sort of large-scale data integration can be achieved by the use of a data integration engine based on graph theory. Furthermore, reasoning engines can be constructed to perform automated reasoning over the data annotated with these types of ontologies. The result will be a better understanding of the meaning of the results of a wide variety of experiments and the increased ability to develop further hypothesis in silico [14]. Some attempts have been made in order to define what an investigation is, what the difference between a test and an assay is, how we can classify experiments and how to annotate research endeavours in order to facilitate contextualised information retrieval. One of the first 218

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

ontologies addressing the problem of describing experiments was the MGED Ontology (henceforth MO) [4]; it was developed as a collaborative effort by members of the MGED Ontology working group in order to provide those descriptors required to interpret microarray experiments. Although these concepts were derived from the MicroArray and Gene Expression Object Model (MAGE-OM), which is a framework to represent gene expression data and relevant annotations [5], in principle any software capable of consuming the ontology can use these descriptors. There is thus a separation between both the functional and the declarative (ontological) model. Throughout this thesis, the need to support integrative approaches on rich and useful graphical environments has been clearly stated. Designing these environments is a research topic not sufficiently studied within the context of bioinformatics. There have been very few Human-Computer Interaction (HCI) evaluations over bioinformatics tools; moreover, HCI and cognitive aspects are rarely considered when designing biological information systems. Information foraging, which refers to activities associated with assessing, seeking, and handling information sources [6] has also not been considered in bioinformatics. Such search is adaptive to the extent that it makes optimal use of knowledge about the expected value of the information, and the expected costs of accessing and extracting it. Humans seeking information adopt different strategies when gathering information and extracting knowledge from the results of their searches. A relationship between the user and the information is then built. The relation is easy if the data are presented to the user in a clear way, if the information provides extraction tools, and especially if the information is understandable, structured, and immersed in the right context. Value and relevance are not intrinsic properties of information-bearing representations, but can be assessed only in relation to the environment in which the task is embedded. Graphical User Interfaces (GUIs) should facilitate managing and accessing the information. Graphical environments should relieve users from high learning curves, and difficulties accessing command line based interfaces. There is a need to establish a clear separation between the operational domain and the domain of knowledge. For a researcher, 219

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

finding a protein defined by some specific features along with all the relevant bibliographic references should not be a daunting task. Integrative approaches should therefore be integral. Ontologies may help in creating coherent visual environments, as has already been shown by Stevens et al. with the TAMBIS [7] project.

8.3

TOWARDS A SEMANTIC WEB IN BIOLOGY

The field of bio-ontology development has been surprisingly active in recent years; partly because of the premise that it will encourage and enable knowledge sharing and reuse, but also because the biological community is gradually adopting a holistic approach for which context is critical - shifting paradigm, some would say. In order to achieve this “holistic view”, it is indispensable to develop ontologies that accurately describe the reality of the world. Different groups will develop this ontological corpus –as is currently happening. Those efforts already in place are independent from one another, and made in response mostly to ad hoc necessities. Ironically for the biological community, we may be re-writing an old story: database integration in molecular biology has long been a problem, partly due to the fact that most approaches to data integration have been driven by necessity. By the same token biological ontologies have been developed as a momentary response to a particular need. This has lead the bio-communities to describe their worlds from their particular perspective, not taking into account that at a later stage these ontologies are needed to describe the “big picture”. This approach also carries negative implications for the maintenance and evolution of the ontologies. This situation should change within the coming years, not only because of those lessons learned, but also because the “big picture” will drive biology more and more, making it necessary to have articulated descriptions by using well-harmonised ontologies. More importantly, ontologies are being, slowly but firmly, separated from object models. This independence should allow ontologies to be used across a wide range of applications. For instance, any Laboratory Information Management System should be able to use the same 220

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

descriptors for those processes for which it was designed, thereby enabling data sharing and to some extent knowledge sharing. Full experiments could then be easily replicated. Ontologies should be independent from computer realisations. Are we heading towards a semantic web (SW) in bioinformatics? It was Tim Berners-Lee who initially presented a vision of a unified data source where, as a consequence of highly integrated systems, complex queries could be formulated [8]. It has been long since this vision was presented, and many different approaches have been developed in order to make it operative; so far it is still hard to define what semantic web really means. The SW may be seen as a knowledge base where semantic layers allow reasoning and discovery of hidden relations, contextualising the information, thereby delivering personalised services. In the development of the semantic web there is, thus, a pivotal role for ontologies to play, since they provide a representation of a shared conceptualisation of a particular domain that can be communicated between people and applications. In the particular field of bioinformatics, interoperability and integration of information has been at issue since some of the first databanks started to be publicly accessible. Most previous attempts in database integration have addressed the problem of querying, and extracting data from, multiple heterogeneous data sources from the syntactic perspective. If the process can be done via a single query, the data sources involved are considered interoperable and integrated. These approaches do not consider how a particular biological entity might be meaningfully related to others, how it is immersed in different processes, or how it is related to relevant literature sources within the context of a given research; only location and accessibility have been at issue. In the same way, the complexity of a query would be largely a function of how many different databases must be queried, and from these how many internal sub-queries must be formed and exchanged, for the desired information to be extracted. If a deeper layer that embeds semantic awareness were added, it is probable that query capacities would be improved. This can be envisioned as the provision, within an ontological layer, of just enough connective tissue to allow semi-intelligent agents or search engines to execute simplified 221

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

queries against hundreds of sites (McEntire, 2002). Not only could different data sources be queried, but also (more importantly) interoperability would then arise naturally as a consequence of semantic awareness. At the same time, it should be possible to automatically identify and map the various entries that constitute the knowledge relationship, empowering the user to visualise a more descriptive landscape. What is semantic integration of molecular biology databases? What does it mean to have a semantic web for the biological domain? Not so surprisingly, as has been already mentioned, as well as discussed in more detail in Chapter 2, the biological community is heading towards a semantic web. However, this is not new as the biological community was facing all those problems the syntactic web has always had. The semantic web in biology poses an interesting, not so well known, challenge to the semantic web community: that of knowledge representation within communities of practices. Representing and formalising knowledge for semantic web purposes has usually been studied within closed-complete contexts – Amazon3, insurance companies, administrative environments for which a business logic is not only known in advance, but also for which the communities are more prone to follow rules. The biological community is different, and these idiosyncratic factors must be taken in to account. Moreover, it is not clear what constitutes knowledge in a broad sense for the biological community. One could say that a database entry may be considered to be data; however as the database entry is annotated with meaningful information that places it within a valid context for the researcher, the boundaries between data, information and knowledge become difficult to see. As we are heading towards a semantic web in bioinformatics it is important to have the community fully involved. Policies from those consortia gathered to promote the development of bio-ontologies should facilitate this engagement. These consortia, in close collaborations with computer and cognitive scientists, should ideally also address the technological component such an engagement has. A more insightful description of some of the situations this lack of understanding has generated has been presented in Chapter 7.

3

http://www.amazon.com

222

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Simple guidance and criteria such as how to best define the class structure within the bio-domain may make a huge difference. Names should not matter much, as they will proliferate; ontology classes on the other hand should offer a more enduring structure; ideally the class structure should follow axes based on time and space, continuants and ocurrents. There is thus the need to disentangle the meanings from the names; in this way we may achieve a modular-accurate description of the world, based on facts and evidence rather than perceptions. Quoting Barry Smith4: “As Leibniz pointed out several centuries back in his criticism of John Locke's statement (roughly summarised): Since names are arbitrary and our understanding of the world is based on the names we give to things, our understanding of the world is arbitrary. Leibniz agreed names are arbitrary but our description of the real world is based on our best effort to describe facts as we see them - not on names. It's these aspects of Leibniz epistemology that have been used to great effect by the evo-devo researchers who have developed the concept of modularity/complementarity when describing the constraints on evolution - to wit - the possibilities the search space in which evolution functions - are not limitless, but are in fact constrained by the limits of POSSIBLE interactions amongst the many constituent entities.” Evidence is difficult to gather and represent when developing bio-ontologies. Evidence is often part of the discussion amongst domain experts; it is available as unstructured text, not always related to a particular class or property. This makes it hard for knowledge engineers to properly manage evidence. Ideally, an Integrated Community Development Environment for Ontologies (ICDEO) should also offer support for maintaining the evolution of the ontologies, and part of that task is to preserve evidence in a reusable way. Some of these issues are addressed in Chapters 2, 3,4 and 5. As the need to have integral descriptions for research endeavours grows, so does the effort to cope with such a task. FuGO [9] has started to address issues related to modularity and ontology integration. Different groups such as FuGO within the functional genomics context and the Generation Challenge Program (GCP) [10] within the plant world are

4

FuGO mailing list, http://fugo.sourceforge.net/lists/list.php

223

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

currently evaluating some existing standards for ontologies and metadata. Dublin Core [11], SKOS [12], ISO/IEC [13] and others have been part of this assessment. Some guidance will be available for bio-communities so there is a unified criterion to define classes, properties and metadata in general. This is certainly a step in the right direction, but it is yet too soon as to predict future outcomes.

8.4

DEVELOPING BIO-ONTOLOGIES AS A COMMUNITY EFFORT.

This thesis has demonstrated the role of communities of domain experts when developing ontologies; as ontologies imply the contribution and agreement of a community, they may be understood as “social agreements” for describing the reality of a particular domain of knowledge. The whole process resembles in many ways an exercise of participatory design, and even more interestingly it follows the main precept of user-centric design that states that designs should always focus on user’s perceptions. Chapters 2 and 4 not only present methodological aspects about building community ontologies, but also important details of those processes in which these methods were applied. Within the context of designing technology for biological researchers, what is the role of the domain expert? An interesting parallel may be drawn from the field of designing children’s technology. Three main methodologies have been applied: User Centric Design (UCD) [14], Participatory Design (PD) [15], and Informant Design (ID) [16]. They all focus on describing the kind of relation between children and designers, which affects the input obtained. Interestingly, the relationships described by these authors as well as the dynamics that emerges from the relationship between the children and the designer proved to be applicable when designing technology for the biological community. The UCD approach involves children in the design process as testers. This is the traditional role of children as end-users of technology; where they are placed in a reacting role in order to give feedback to designers about the product [17]. In this approach the designers 224

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

define what it is suitable for children; they get to an advanced point in the design process before getting input from the users. The fundamental assumption in the PD approach is that users and designers can view each other as equals. Both therefore take on active roles in the design [17]. Following the same line of thinking, Druin and Soloman [18] have proposed to have children as part of the design team, particularly suggesting metaphors for the designer, and sharing, in some way, in responsibilities and decision making. On the other hand, the ID perspective considers children’s input to play a fundamental role in the design process, thus seeing children not just as testers of technology. The participation of children in this process is defined according to the different phases of design and their goals. This approach is placed somewhere between UCD and PD; children are informants but cannot be considered as co-designers [17]. When developing technology within the biological domain, the predominant approach has been to use the domain expert as an informant on requirements, as well as a tester of the end product. Research, in order to determine what the role of the domain expert should be when developing his/her technology, is therefore sorely needed as the current approach has proven not to be very successful. From our experiences, as reported in Chapters 2 and 4, the constant input and interaction of the domain experts is crucial for the success of the information system. Domain experts should be involved throughout the entire process. This involvement is not only needed when developing ontologies within biological communities, but also during software development. Participatory design is thus the most suitable methodology as the control is shared by all of the design team members, and their research agendas are open to changes and redefinitions. The position of the designers is that of someone who is interested in knowing about domain experts, someone who is willing to reshape his/her own ideas. This perspective supports a closer relationship, where everyone is learning. Designers working within an ID approach assume a position mediated by the goals of the different stages of the process. The research agenda is defined according to the informants' input across the process, hence in those stages where domain experts take part as 225

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

informants, the relationship resembles that promoted by the PD; designers want to know facts they do not know about domain experts. The “control” should therefore be shared whenever possible; domain experts should “lead” the whole process by establishing an equalitarian relationship.

8.5

1.

REFERENCES

Jaiswal P, Avraham S, Ilic K, et al.: Plant Ontology (PO): a controlled vocabulary of plant structures and growth stages. Comparative and Functional Genomics 2005, 6(7-8):388-397.

2.

Jaiswal P, Ware-D N-J, Chang K, Zhao W, Schmidt S, X. P, Clark K, Teytelman L, Cartinhour S, Stein L et al: Gramene: development and integration of trait and gene ontologies for rice. Comparative and Functional Genomics 2002, 3(2):132-136.

3.

Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25(1):25-29.

4.

Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing functional genomics experiments. Comparative and Functional Genomics 2003, 4:127-132.

5.

Spellman P.T., Miller M. SJ, Troup C., Sarkans U., Cher-vitz S., Bernhart D., Sherlock G., Ball C., Lepage M., Swiatek M. et al: Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biology 2002.

6.

Pirolli P, Card KS: Report on Information Foraging. In. Palo Alto: Palo Alto Research Center; 2006.

7.

Stevens R, Baker P, Bechhofer S, Ng G, Jacoby A, Paton NW, Goble CA, Brass A: TAMBIS: Transparent access to multiple bioinformatics information sources. Bioinformatics 2000, 16(2):184-185.

8.


9.

Functional Genomics Investigation Ontology [http://fugo.sourceforge.net/]

226

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

10.

The Generation Challenge Program [http://generationcp.org]

11.

The Dublin Core [http://dublincore.org/]

12.

WD-swbp-skos-core-guide

[http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-

20050510/] 13.

ISO/IEC Metadata Standards [http://metadata-standards.org/]

14.

Norman D, Draper S: User centered system design: new perspectives on HumanComputer Interaction. New Jersey: Lawrence Erlbaum Associates; 1986.

15.

Schuler D, Namioka A: Participatory design: principles and practices. New Jersey: Lawrence Erlbaum Associates; 1993.

16.

Scaife M, Rogers Y, Aldrich F, Davies M: Designing for or designing with? Informant design for interactive learning environments. In: Conference on human factors in computing systems: 1997; Atlanta, Georgia, USA: ACM; 1997.

17.

Scaife M, Rogers Y: Kids as informants: telling us what we didn't know or confirming what we knew already. The design of Children's Technology 1999:28-49.

18.

Druin A, Solomon C: Designing multimedia envoronments for children. New York: John Wiley; 1996.

227

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

9 9.1

Future work

BIO-ONTOLOGIES: THE MONTAGUES AND THE CAPULETS, ACT TWO, SCENE TWO: FROM VERONA TO MACONDO VIA LA MANCHA.

9.1.1

Introduction As the need for integrated biological research grows, ontologies become more and

more important within the life sciences. The biological community has a need not only for controlled vocabularies but also for guidance systems for annotating experiments, better and more reliable literature mining tools, and most of all a consistent, shared understanding of what the information means. Ontologies, thus, should be understood as a starting point, not as an end by themselves. Although several efforts to provide biological communities with these required ontologies are currently in progress some of them have thus far proven to be too slow, too expensive, and too error-prove to meet the demand. The difficulties in these developments are due not only to the ambiguity of natural language and the fact that biology is a highly fragmented domain of knowledge, but is also due to the lack of consistent methodologies for ontology building in loosely centralised environments such as the biological domain [1, 2]. Biologists need methodologies and tools in the same vein that computer scientists need real life problems to work on. Collaboration would thus be the easiest way to move forward. However, such interaction has proven difficult as the two “houses” of Biology and Computer Science continue to fight each other in the field of windmills where Don Quixote is pointing a way towards the horizon. Our three houses have been accurately described by Goble and Wroe [3]. Firstly “The Montagues”: “one, comforted by its logic’s rigour/Claims ontology for the realm of pure”. Goble and

Wroe define this house as the one of computer science, knowledge management, and Artificial Intelligence (AI). This community essentially works with well-scoped and behaved problems; they work with generalisations, and expect to have broadly applicable results. Their 228

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

interest in ontologies lays in testing the boundaries of knowledge representation, expressiveness of languages, and the suitability of reasoning engines. For this community building ontologies is a matter of research, which is lead by research interests. The community involvement in these developments is minimal and domain experts are mainly informants, the role of the knowledge engineer is predominant. Their ontologies are mostly deployed on a once-off basis. Our second house, “The Capulets”: “The other, with blessed scientist’s vigour/acts hastily on models that endure.” As Goble and Wroe defined it, this is the house of Life Sciences. Within this community the purpose of bioinformatics is to support their research endeavours. This is a community with a pragmatic and lead-by-need vision of computer science, with a strong application pull. Ontologies for the Capulets are basically controlled vocabularies, taxonomies that allow them to classify things, very much in accordance with a very old tradition in this domain, one that started with the likes of Aristotle and Linnaeus. Within this house, the role of the knowledge engineer is that of someone who promotes collaboration in a loosely centralised environment. Biologists are thus not only leading the process but also designing the ontology and the software that will ultimately utilise the ontology. Their ontologies are living entities, constantly evolving. Following Goble and Wroe’s analogy (henceforth Act 1, Prologue) we also have a third house: The Philosophers. For narrative purposes it has been decided to name this as the house of Don Quixote. For this house the essence of the "things" is important, as they seek a single model of truth itself. Some tangible contributions of this house are those studies of the part/whole relationship, how to model time, criteria for distinguishing among mutations, transformations, perdurance, and endurance. Thanks to the heavy emphasis on theory, their work this house has provided us with a conceptual corpus for understanding ontologies. The same story will be used as a baseline for the remaining of this paper. Although their houses endure we may be shifting acts and scenarios. As the Montagues and Capulets dig deeper into their discrepancy, are we moving from Verona, via La Mancha, to Macondo, where we all may face a hundred years of solitude? This literary analogy introduces thus a 229

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

possible ending point, Macondo, where we all may find the land of endogenous agreements. A brief history of this drama is presented in Section 6.1.2; Section 6.1.3 presents some of the duels between the two main houses of our narrative. The disjunctive marriage or poison is discussed in Section 6.1.4. We argue in this section the potential danger of heading towards Macondo versus remaining in Verona and finally living happily ever after. 9.1.2

Some background information Act 1, Scene 1: “Verona. A public place.” This is indeed true for both of our dignified

households. Computer scientists and biologists actively promote open source initiatives. The Capulets have a long-standing tradition where sharing code is an everyday activity. The OpenBio initiatives are a clear example of this fact; however these initiatives are not resources for workbench biologists, they are meant to support bio-programmers and bioinformaticians. The Montagues also have an interesting record of collaborative efforts, the development of the Linux kernel and KDE (K Desktop Environment) to name two. Sharing code is, however, different from sharing knowledge. Unfortunately little attention has been paid as to exactly “how” these communities have carried out the process of knowledge management in their corresponding projects [4]. Act 1, Scene 2: “Halls and rooms in our households’ houses.” During the last several years The Capulets have been developing different ontologies: The Gene Ontology (GO), the Microarray Gene Expression Data (MGED) Ontology (henceforth, MO) [5], a comprehensive list is provided by OBO [6]. By the same token The Montagues have several ontological initiatives such as OpenCyc [7] that is a general knowledge base, and SUO (Standard Upper Ontology) [8]. The SUO WG (working group) is developing a standard, which aims to specify an upper ontology to support computer applications such as data interoperability, information search and retrieval, automated inference, and natural language processing. Act 1, Scene 3: “A lane by the wall of our household’s orchard.” While the Capulets focus on standardising the words and their meanings, our Montagues embrace relationships, 230

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

logical descriptions, agent technology, and give serious considerations to some of the insights from the House of Don Quixote such as Time, Matter, Substance, Mutability and many other essential properties of a concept. The Montagues and the Quixotes tend to see the Capulets’ ontologies as dictionaries, and are prone to point out those deficiencies with accuracy [9, 10]. The Capulets defend their efforts with passion; a valid point in their favor is the lack of knowledge that Montagues and Quixotes have about the biological community. 9.1.3

The Duels and the duets. Act 2, Scene 2: “What's in a name? That which we call a rose by any other name would

smell as “sweet”. It was over two years ago that Hunter [11] responded to Brenner’s [12] comment in Genome Biology. The interesting issue in those discussions was that the fundamental question both were trying to address was never stated explicitly: What is the role of ontologies in the life sciences? Since ontologies may also be understood as social agreements, arguing that they are for programs and not for people, the way Hunter responds to Brenner is completely descriptive of their purpose. It is also true that Brenner misses the point by portraying ontologies as solely taxonomies of words. Conceptual objects have concrete representations, they are in a way tangible objects, and therefore should enable computational tasks. Act 2, Scene 2, part 2: “A fertile and dangerous playground” Recently Soldatova et al. [10] published a series of shortcomings related to MO. It did not take a long time for Stoeckert et al. to respond [13]. One interesting issue in this scene is that neither party actually addressed a key point about an ontology that should provide the conceptual scaffold for describing microarray experiments. For instance, such an ontology should provide not just those minimal descriptors but also the logical constraints so that inference will be possible. Ontologies are not just controlled vocabularies; they should also provide support for reasoning processes. In order to describe a biological investigation it is necessary to use many different ontologies; how can we integrate these orthogonal ontologies so the final narrative makes sense? 231

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

So far ontologies have focused on describing things and processes. However, the relationship between the entity in question and the process by which it is studied has not yet been fully explored. A “thing” is immersed in a context in which it is informative. The fragmentation of processes and things should be consistent (i.e. definition of whole/part of relations) so agents are able to “cut through” the myriad of information annotated with ontologies. How shall we encompass those apparently unrelated descriptions? The other side of the coin deals with the context due to the broader picture where the “thing” is of interest. For instance, when studying a disease we need to gather information not only about those experimental processes, but also the different responses of the system to the alterations we have caused. We don’t only need to describe the “thing” we are studying but also the context in which it is being studied. A disease may be seen as alteration of one or more metabolic pathways, with the subsequent molecular implications. It may be described as a series of objects with individual states, individual disease instances, and with relationships between particular objects. Disease representation requires capturing individual object states as well as the relationships between different objects. For example one can use the GO term 0005021: vascular endothelial growth factor receptor as a partial descriptor of gene FLT3: fms-related tyrosine kinase. This also allows the implication of an ATP binding activity to this gene, as is understood from [14]. This, however, says nothing about the circumstances of the gene/protein product in a disease state, or in an individual disease instance. The same can be said for disease objects, which can also be effectively described by ontologies, but without state or relationship provision. Is it possible with existing ontologies to accurately describe a disease from both phenotypic and genotypic perspectives? Since ontologies offer what Brenner defines as dictionaries, such a representation is not yet possible. To some extent, the solution seems to be a linguistic exercise, utilising curated data sources and the biomedical text to first define the relevant objects as they are both officially and commonly expressed, and then to both define and determine the syntactic and semantic relationships between objects. To us, the existing and emerging ontologies play a key role in tethering the objects to an objective structure. However, the object states and relationships are what truly represent disease states and 232

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

instances. In our view, the dynamic nature of individualised disease states requires a more flexible conceptual model, which encompasses the bridging of separate ontologies through relationships. Act 2, Scene 3: “Lingua Franca: a sanctorum by the orchard: He discovered then that he could understand written English and that between parchments he had gone from the first page to the last of the six volumes of the encyclopedia as if it were a novel.” As Brenner states explicitly in his comment, we need to be fluent in our own language - but what does it mean to be fluent in one’s own language? If someone learned a few phrases so that they could read menus in restaurants and ask for directions on the street, would you consider them fluent in the language? Certainly not. That type of phrase-book knowledge is equivalent to the way most people use computers today. Is such knowledge useful? Yes. But it is not fluency. To be truly fluent in a foreign language, you must be able to articulate a complex idea or tell an engaging story; in other words, you must be able to “make things” with language. Analogously, being digitally fluent involves not only knowing how to use technological tools, but also knowing how to construct things of significance with those tools [15]. Learning how to use Protégé does not make you an ontologist; by the same token knowing GO does not make you a biologist. Respect and understanding for others motivations, contributions and needs are fundamental for a successful marriage. 9.1.4

Marriage, Poison, and Macondo “The world was so recent that many things lacked names, and in order to indicate them it was necessary

to point”. Verona and Macondo are an apt metaphor for Bio-ontologies today. On one hand Verona represents the possible starting point from which we may all do business and thus engage in win-win situations, as described by Stein [16]. Alternatively Macondo represents the undesired possible arrival point, a magical realism where man's astonishment before the wonders of the real world are expressed in isolation. Since the real world encompasses different views, codes of practice, rules, values, and areas of interest, we should focus on our common point: fostering interdisciplinary collaboration and communication and thus engaging in business. GONG (Gene Ontology Next Generation) [17] and FuGO (Functional 233

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Genomics Investigation Ontology) [18] may illustrate how to work together. They may eventually show us important lessons not only from the ontological perspective but also from the community perspective. However, it is yet too soon as to fairly evaluate those lessons learnt. Some practical realism would consequently come in quite handy if we all are to avoid a hundred years of solitude. 9.1.5 1.

References Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for Distributed, Loosely-controlled and evolving engineering of ontologies. In: European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

2.

Garcia Castro A, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S: The use of concept maps during knowledge elicitation in ontology development processes - the nutrigenomics use case. BMC Bioinformatics 2006, 7:267.

3.

Goble C, Wroe C: The Montangues and the Capulets. Comparative and Functional Genomics 2004, 5:623-632.

4.

Hemetsberger A, Reinhardt C: Sharing and Creating Knowledge in Open-Source Communities The case of KDE. In: Fifth European Conference on Organizational Knowledge, Learning and Capabilities: 2004; Innsbruck, Austria; 2004.

5.

Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing functional genomics experiments. Comparative and Functional Genomics 2003, 4:127-132.

6.

OBO [http://obo.sourceforge.net/]

7.

OpenCyc [http://www.opencyc.org/]

8.

IEEE: http://suo.ieee.org/. 2006.

9.

Smith B, Williams J, Schulze-Kremer S: The Ontology of the Gene Ontology. In: AMIA Symposium: 2003; 2003.

10.

Soldatova N, Larisa., King D, Ross.: Are the current ontologies in biology good ontologies? Nature Biotechnology 2005, 23:1095-1098.

11.

Hunter L: Genome Biology. Genome Biology 2002, 3(6).

234

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

12.

Brenner S: Life sentences: Ontology recapitulates philology. Genome Biology 2002, 3(4).

13.

Stoeckert CJ, Ball C, Brazma A, Brinkman R, Causton H, Fan L, Fostel J: Wrestling with SUMO and bio-ontologies. Nature Biotechnology 2006, 24:21-22.

14.

fms-related tyrosine kinase 3 [http://www.ncbi.nih.gov /entrez /query.fcgi?db=gene &cmd=Retrieve &dopt=summary &list_uids=2322]

15.

Resnick M: Rethinking Learning in the Digital Age. In The Global Information Technology Report: Readiness for the Networked World: Oxford University Press; 2002.

16.

Stein L: Creating a bioinformatics nation. Nature 2002, 417:119-120.

17.

Gene Ontology Next Generation [http://gong.man.ac.uk/]

18.

Functional Genomics Investigation Ontology [http://fugo.sourceforge.net/]

235

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

APPENDIXES

GLOSSARY Activity: A constituent task of a process Ad hoc developments: Solutions addressing a specific problem via specific software development. Application ontologies: application ontologies are specialisations of domain and task ontologies as they form a base for implementing applications with a concrete domain and scope. Communities of practice: Communities of practice are the basic building blocks of a social learning system because they are the social ‘containers’ of the competences that make up such a system. Communities of practice define competence by combining three elements. First, members are bound together by their collectively developed understanding of what their community is about and they hold each other accountable to this sense of joint enterprise. To be competent is to understand the enterprise well enough to be able to contribute to it. Second, members build their community through mutual engagement. They interact with one another, establishing norms and relationships of mutuality that reflect these interactions. To be competent is to be able to engage with the community and be trusted as a partner in these interactions. Third, communities of practice have produced a shared repertoire of communal resources—language, routines, sensibilities, artefacts, tools, stories, styles, etc. To be competent is to have access to this repertoire and be able to use it appropriately. Competency questions: Are understood here as those questions for which we want the ontology to be able to provide support for reasoning and inferring processes. Concept maps: A concept map is a diagram showing the relationships among concepts. Concepts are connected with labelled arrows, in a downward-branching hierarchical structure.

236

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

The relationship between concepts is articulated in linking phrases, e.g., "gives rise to", "results in", "is required by," or "contributes to". Domain analysis: Is the process by which a domain of knowledge is analysed in order to find common and variable components that best describe that domain Domain expert: A domain expert or subject matter expert (SME) is a person with special knowledge or skills in a particular area. Domain experts are individuals who are both knowledgeable and extremely experienced with application domains Domain ontologies: Ontologies that describe specific vocabulary. GCG: Formally known as the GCG Wisconsin Package, the GCG contains over 140 programs and utilities covering the cross-disciplinary needs of today’s research environment. KAON: KAON is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. Knowledge: Knowledge is a mix of framed experience, values, contextual information, expert insight and grounded intuition that provides an environment and framework for evaluating and incorporating new experiences and information. It originates and is applied in the minds of knowers. In organisations, it often becomes embedded not only in documents or repositories but also in organisational routines, processes, practices and norms Knowledge elicitation: Is the process of collecting from a human source of knowledge, information that is relevant to that knowledge Life cycle: Is a structure imposed on the development of a software product. MGED: Microarray and Gene Expression Data (MGED). This is an international organisation of biologists, computer scientists, and data analysts that aim to facilitate the sharing of data generated using the microarray and other functional genomics technologies for a variety of applications including expression profiling.

237

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Method: An orderly process or procedure used in the engineering of a product or performing a service. Methodology: Comprehensive integrated series of techniques or methods creating a general system theory of how a class of thought-intensive work ought to be performed. MOBY: The MOBY system for interoperability between biological data hosts and analytical services. Onthology: An ontology is a non-necessarily complete, formal classification of types of information structured by relationships defined by the vocabulary of the domain of knowledge and by the canonical formulations of its theories. Platform: A general, non-purpose specific solution. A platform for data integration offers a technological framework within which it is possible to develop point solutions; usually

platforms

provide

non-proprietary

languages,

data

models

and

data

exchange/exporting systems, and are highly customizable. Process: Function that must be performed in the software life cycle. Protégé: Ontology editor. Relevant scenarios: Scenarios in which they considered the term were going to be used. Semantic Web (SW): The semantic web is an evolving extension of the World Wide Web in which web content can be expressed not only in natural language, but also in a format that can be read and used by software agents, thus permitting them to find, share and integrate information more easily. Task: Is the atomic unit of work that may be monitored, evaluated and/or measured. A task is a well defined work assignment for one or more project member. Related tasks are usually grouped to form activities. Task ontologies: Those ontologies that describe vocabulary related to tasks, processes, or activities.

238

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

TAVERNA: The Taverna project aims to provide a language and software tools to facilitate easy use of workflow and distributed compute technology within the Science community. Technique: Technical and managerial procedure used to achieve a given objective. Terminology extraction: Terminology extraction, term extraction, or glossary extraction, is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus. Text mining: Text mining, sometimes alternately referred to as text data mining, refers generally to the process of deriving high quality information from text. Text2ONTO: Software that allows the extraction of terminology. UNIX: is a computer operating system. Workflow: Workflow is a reliably repeatable pattern of activity enabled by a systematic organization of resources, defined roles and mass, energy and information flows, into a work process that can be documented and learned. Workflows are always designed to achieve processing intents of some sort, such as physical transformation, service provision, or information processing. W2H: W2H is a free WWW interface to sequence analysis software tools like the GCGPackage (Genetic Computer Group), EMBOSS (European Molecular Biology Open). Software Suite) or to derived services (such as HUSAR, Heidelberg Unix Sequence Analysis Resources). W3H: workflow system for W2H.

239

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

ACRONYMS BLAST: Basic Local Alignment Search Tool. BRENDA: Braunschweiger Enzyme Database. BSML: Bioinformatics Sequence Markup Language. CODATA: Committee on Data for Science and Technology. CPL: Combined Programming Language. DTD: Document Type Definition. EMBOSS: European Molecular Biology Open Software Suite. GBIF: Global Biodiversity Information Facility. GO: Gene Ontology. G-PIPE: Graphical Pipe. GUI: Graphical user Interface. HTML: HyperText Markup Language. HUSAR: Heidelberg Unix Sequence Analysis Resources. ICIS: International Crop information System. IEEE: Institute of Electrical and Electronics Engineers. Jemboss: Java EMBOSS. KEGG: Kyoto Encyclopedia of Genes and Genomes. MAGE: MicroArray and Gene Expression. MGED: Microarray and Gene Expression Data (MGED). MIAME: Minimum Information About a Microarray Experiment. MO: Microarray Ontology. 240

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

OQL: Object Query Language. PATH: Phylogenetic Analysis Task in HUSAR. PISE: Pasteur institute software environment. PO: Plant Ontology. PSI: Proteomics Standards Initiative. RSBI: Reporting Structure for Biological Investigations. SOAP: Simple Object Access Protocol. SNP: Single Nucleotide Polymorphism SQL: Structured Query Language). SRS: Sequence Retrieval System. SW: semantic web. TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources. XML: Extensible Markup Language Xpath: xml Path XQL: XML Query Language

241

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

APPENDIX 1 – RSBI ONTOLOGY This version of the RSBI ontology represents high-level concepts usually found in the description of biological investigations. Protégé, version 3.1., was the ontology editor software used during the development of this ontology.

242

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 1 - Figure 1. Identified properties for the RSBI ontology.

243

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 1 - Figure 2. RSBI ontology

244

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 1 - Figure 3. A concept map for RSBI ontology.

245

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

APPENDIX 2 – EXTRACTED TERMINOLOGY This list of terms was gathered by using Text2Onto, part of the KAON ontology framework. In total, eight documents were scanned with this software. c-value

occurrences

term

-14,62642133 -14,62642133 -12,62642133 -12,62642133 -12,62642133 -14,62642133 -20,62642133 -12,62642133 -12,62642133 -12,62642133

numwords 1 1 1 1 1 1 1 1 1 1

alternate term

36 31 15 12 14 30 98 12 17 15

time reference computer nucleus foundation tissue change error bank sib

-13,62642133 -16,62642133 -12,62642133 -21,62642133

1 1 1 1

22 56 11 101

country format speed input

-14,62642133 -12,62642133 -12,62642133 -22,62642133

1 1 1 1

39 13 10 113

way instln comment function

functionality

-12,62642133 -14,62642133 -13,62642133 -12,62642133

1 1 1 1

14 32 26 15

cop note mouse dsn

dsns

-13,62642133 -15,62642133 -33,62642133 -12,62642133 -12,62642133 -13,62642133 -32,62642133 -12,62642133 -25,62642133 -13,62642133 -12,62642133 -12,62642133 -12,62642133 -14,62642133

1 1 1 1 1 1 1 1 1 1 1 1 1 1

24 45 228 12 12 23 216 18 140 23 14 13 18 36

pointer option number guest purdy space gms measure sf f breeder query containing update

computing

formation

spacing measurement

246

alternate term

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

-12,62642133 -13,62642133 -14,62642133 -15,62642133 -19,62642133 -14,62642133 -12,62642133 -12,62642133 -27,62642133 -14,62642133 -14,62642133 -28,62642133 -20,62642133 -17,62642133 -12,62642133 -29,62642133 -35,62642133 -18,62642133 -14,62642133 -18,62642133 -13,62642133 -14,62642133 -23,62642133 -12,62642133 -12,62642133 -12,62642133 -12,62642133 -14,62642133 -14,62642133 -13,62642133 -12,62642133 -17,62642133

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

10 24 32 47 89 36 11 13 169 37 30 177 96 69 11 185 243 73 31 75 23 39 123 14 13 19 18 36 33 24 15 61

research launcher selector gms_success attribute year title if s germplasm_id isolation entry string gms_error highlight line seed search drb it fopt row species storage r place mode step acquisition press subdirectory code

researcher

-15,62642133 -12,62642133 -13,62642133 -12,62642133

1 1 1 1

42 13 21 17

zero user_id store design

designation

-27,62642133 -16,62642133 -12,62642133 -12,62642133

1 1 1 1

167 55 19 19

location address variate view

variation viewing

-12,62642133 -46,62642133 -12,62642133

1 1 1

18 351 16

haploid cross routine

searching

crossing

247

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

-12,62642133 -17,62642133 -12,62642133 -13,62642133 -17,62642133 -13,62642133 -22,62642133 -14,62642133 -14,62642133 -68,62642133 -12,62642133 -12,62642133 -12,62642133 -24,62642133 -12,62642133 -13,62642133 -12,62642133 -25,62642133 -12,62642133 -12,62642133 -12,62642133 -15,62642133 -12,62642133 -33,62642133 -12,62642133 -25,62642133 -12,62642133 -13,62642133 -12,62642133 -15,62642133 -12,62642133 -12,62642133

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

12 66 11 28 68 20 116 39 37 576 17 17 10 137 11 22 12 146 12 11 14 44 15 224 13 146 10 26 11 43 13 15

ascii n production call crop development cf tree increase germplasm origin genesis polyploid icis evaluation second range gen case parentage see output descent population standardisation parent log starting key command ltype dll

-12,62642133 -16,62642133 -12,62642133 -15,62642133

1 1 1 1

14 50 10 45

display g ntype text

-15,62642133 -21,62642133 -12,62642133 -12,62642133

1 1 1 1

47 108 14 10

gidx c column lgms

-12,62642133 -14,62642133 -17,62642133

1 1 1

12 36 66

session male half

standard

start

248

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

-12,62642133 -36,62642133 -13,62642133 -12,62642133 -15,62642133 -12,62642133 -19,62642133 -13,62642133 -16,62642133 -12,62642133 -18,62642133 -12,62642133 -13,62642133 -14,62642133 -12,62642133 -15,62642133 -12,62642133 -17,62642133 -14,62642133 -13,62642133 -16,62642133 -15,62642133 -26,62642133 -12,62642133 -13,62642133 -19,62642133 -12,62642133 -21,62642133 -12,62642133 -17,62642133 -12,62642133 -14,62642133

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

11 250 24 11 49 17 87 24 53 13 74 10 20 33 10 41 14 62 30 20 57 46 155 10 20 82 18 102 19 65 17 34

copy database genealogy exe browse inger derivative mating menu instance cultivar aim pollination order factor directory purification dsp link abbreviation syntax general gid k will date meth self variety default gmsinput cv

-12,62642133 -23,62642133 -13,62642133 -23,62642133

1 1 1 1

17 123 21 121

print plant end information

-27,62642133 -45,62642133 -14,62642133 -25,62642133

1 1 1 1

163 349 39 144

o list variable field

-13,62642133 -12,62642133 -19,62642133

1 1 1

23 15 87

top gene installation

genealogies exes

derivation mate

generation gids

generative

selfing

selfs

planting

listing variability

249

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

-17,62642133 -20,62642133 -15,62642133 -12,62642133 -12,62642133 -14,62642133 -13,62642133 -12,62642133 -23,62642133 -13,62642133 -12,62642133 -12,62642133 -13,62642133 -18,62642133 -13,62642133 -32,62642133 -14,62642133 -20,62642133 -29,62642133 -12,62642133 -17,62642133 -28,62642133 -12,62642133 -26,62642133 -12,62642133 -15,62642133 -12,62642133 -13,62642133 -12,62642133 -12,62642133 -12,62642133 -12,62642133

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

60 97 41 15 14 36 27 13 127 22 15 10 29 79 23 219 30 96 182 16 67 179 11 155 17 48 15 21 13 13 13 11

pedigree integer section descriptor tool d history double value release export methn expansion collection point user l structure data gidy diallel source destination ids help ini mass odbc mixture reason listbox szbuffer

-17,62642133 -12,62642133 -12,62642133 -12,62642133

1 1 1 1

65 11 16 11

character item form h

-16,62642133 -31,62642133 -13,62642133 -16,62642133

1 1 1 1

52 200 27 50

check click import tester

-12,62642133 -13,62642133 -13,62642133

1 1 1

19 25 25

auto password right

doubling

exporting

id

checking

250

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

-14,62642133 -30,62642133 -13,62642133 -13,62642133 -17,62642133 -18,62642133 -16,62642133 -14,62642133 -21,62642133 -14,62642133 -12,62642133 -13,62642133 -12,62642133 -13,62642133 -12,62642133 -15,62642133 -13,62642133 -16,62642133 -13,62642133 -23,62642133 -12,62642133 -22,62642133 -13,62642133 -12,62642133 -12,62642133 -12,62642133 -13,62642133 -12,62642133 -12,62642133 -12,62642133 -12,62642133 -12,62642133

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

30 197 25 20 62 78 56 31 106 33 13 25 14 23 14 47 23 51 24 128 12 110 28 15 16 18 20 18 10 12 14 10

convention type site run bulk man fertilising culture figure find_next return element entity identification t administrator length use level clone termination group implementation parse day ir target replacement array follows material multiple

-26,62642133 -12,62642133 -13,62642133 -13,62642133

1 1 1 1

154 11 23 28

access initialisation germplsm wheat

accessing

-12,62642133 -14,62642133 -12,62642133 -12,62642133

1 1 1 1

10 31 13 12

test cytoplasm cd bw

testing

-12,62642133 -32,62642133 -23,62642133

1 1 1

17 215 128

fieldbook method record

running bulking fertilisation

administration

terminal

terminator

parsing

multiplication

bws

recording

251

accession

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

-18,62642133 -16,62642133 -27,62642133 -13,62642133 -12,62642133 -12,62642133 -51,62642133 -12,62642133 -12,62642133 -17,62642133 -12,62642133 -13,62642133 -13,62642133 -12,62642133 -17,62642133 -12,62642133 -13,62642133 -12,62642133 -12,62642133 -20,62642133 -12,62642133 -13,62642133 -13,62642133 -12,62642133 -12,62642133 -23,62642133 -13,62642133 -12,62642133 -13,62642133 -12,62642133 -64,62642133 -13,62642133

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

74 54 164 25 18 13 405 13 15 67 18 20 21 10 62 10 23 12 11 93 12 23 24 19 16 124 26 19 26 15 537 24

example system description identifier back cycle table mutation works file a size month buffer e download match find_first definition argument batch dialog button char term process female i status local name maintenance

-16,62642133 -15,62642133 -21,62642133 -16,62642133

1 1 1 1

54 44 104 55

management application set progenitor

manager

-18,62642133 -12,62642133 -12,62642133 -13,62642133

1 1 1 1

74 11 18 26

programming part layout box

program

-30,62642133 -13,62642133 -13,62642133

1 1 1

197 24 23

window spp gms_germplasm

working

matching

processing

naming

setting progenitors

252

work

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

-27,62642133 -16,62642133 -18,62642133 -12,62642133 -12,62642133 -12,62642133 -12,62642133 7,624618986 0 9,704060528 13,16979643 6,931471806 9,010913347 13,86294361 11,78350207 8,317766167 16,63553233 16,63553233 6,931471806 21,4875626 13,86294361 9,010913347 13,86294361 11,78350207 8,317766167 8,317766167 0 6,931471806 2,772588722 10,39720771 22,18070978 31,88477031

1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

165 50 76 11 11 10 18 11 12 14 19 10 13 20 17 12 24 34 10 31 20 13 20 17 12 12 22 11 14 15 32 46

selection landrace breeding m template privilege problem location ltype line cf name search entry code import clone recurrent parent population sf group source existing list generative process half diallel cf acquisition male parent long integer list entry recurrent selection full diallel complex top foundation seed plant selection type database weedy spp local installation list selector gen s

33,96421185 13,16979643 11,09035489 6,931471806

2 2 2 2

49 19 16 11

germplasm table group id data source element name

6,931471806 11,09035489 9,704060528 13,16979643

2 2 2 2

10 16 14 33

ie clone s man g self fertilisation

9,704060528 0 12,47664925

2 2 2

14 12 18

random bulk cultivar line inbred line

ms

name searching

group ids

self fertilising

253

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

27,03274004 10,39720771 11,78350207 8,317766167 7,624618986 8,317766167 13,86294361 6,931471806 7,624618986 0,01 22,87385696 45,74771392 8,317766167 4,852030264 5,545177444 7,624618986 6,931471806 10,39720771 8,317766167 9,010913347 7,624618986 23,56700414 7,624618986 13,86294361 7,624618986 40,89568365 9,010913347 13,86294361 6,931471806 8,317766167 13,86294361 6,238324625

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

53 29 17 12 11 12 20 10 11 16 33 66 13 10 20 11 24 15 12 13 11 34 11 20 11 59 13 29 10 12 20 26

argument type use description germplasm id user name full sib certified seed table row wild spp purdy cross plant s female parent local database field name structure element tester line method number right click landrace population location information germplasm bank preferred abbreviation gms_success gms_error breeding method gms database bulk selection central database string containing type use long output store seed number integer derivative method

16,63553233 13,86294361 0 9,704060528

2 2 2 2

24 20 13 14

fertilised species cross expansion terminated string collection population

6,931471806 8,317766167 8,317766167 0

2 2 2 2

10 12 22 14

tester population list manager diallel cross seed descent

12,47664925 0 48,52030264

2 2 2

18 12 70

source germplasm heterozygous plant gen o

list management

254

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

17,32867951 8,317766167 9,704060528 0 24,26015132 8,317766167 9,010913347 0 8,317766167 13,86294361 11,09035489 9,704060528 1,386294361 25,64644568 2,772588722 0 11,09035489 9,010913347 17,32867951 11,09035489 6,931471806 9,010913347 15,24923797 7,624618986 19,40812106 11,09035489 30,49847594 13,86294361 9,010913347 22,87385696 19,40812106 13,86294361

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

25 12 14 12 57 12 25 14 24 20 16 14 16 37 14 17 16 13 25 16 11 23 22 11 28 16 44 20 13 33 28 20

single cross derivative process different entry sf seed single plant accession number seed increase window right fertilising species population cf line sf pure seed single seed local gms spp population unknown derivative landrace cultivar description type browse window fertilising crop database field cross fertilising man o dialog box germplasm list mass selection central gms man s search string germplasm record tissue culture man c

0 11,09035489 12,47664925 28,4190344

2 2 2 2

12 16 18 41

line population pure line sf acquisition preferred name

8,317766167 9,010913347 11,09035489 10,39720771

2 2 2 2

12 13 16 15

bulk sf local germplasm new germplasm selected set

13,16979643 11,09035489 9,010913347

2 2 2

19 16 13

output address cross history layout file

255

cross fertilisation

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

28,4190344 8,317766167 6,931471806 6,931471806 10,39720771 6,931471806 10,39720771 11,09035489 25,64644568 10,39720771 8,317766167 10,39720771 14,55609079 8,317766167 13,18334746 9,887510598 13,18334746 9,887510598 6,591673732 18,67640891 10,98612289 9,887510598 10,98612289 15,38057204 15,38057204 5,493061443 9,887510598 13,18334746 0 14,28195975 15,38057204 10,98612289

2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

41 12 10 13 15 10 15 16 38 15 12 15 21 12 12 11 12 11 11 17 10 11 10 14 14 10 11 12 29 13 14 10

long input generative method gms_germplasm output name table user ids database administrator current germplasm way cross name type germplasm data sf collection naming convention random mating change record heterozygous plant s database field name sf seed increase name type database field name table unknown derivative method half diallel cross element name type cross fertilising species self fertilising species window right click structure element name type database field tester line cf type use description null terminated string single seed descent weedy spp population

24,16947035 0 13,18334746 0

3 3 3 4

22 29 12 11

single plant selection argument type use cultivar line population database field name table

1,386294361 0 0 0

4 4 4 4

11 11 10 11

element name type database name type database field structure element name type type database field name

40,20253647 16,09437912

4 5

29 10

17,70381704

5

11

argument type use description structure element name type database name type database field name

256

user id

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

17,70381704 17,70381704

5 5

11 11

element name type database field type database field name table

257

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

APPENDIX 3 – GMS BASELINE ONTOLOGY (VERSION 1) This version of the GMS ontology corresponds to the work done by Patrick Ward and Mark Wilkinson. Domain experts from The International Centre for Tropical Agriculture (CIAT) worked with this version at a later stage. Protégé, version 3.1., was the ontology editor software used during the development of this ontology.

Appendix 3 - Figure 1. A portion of the first version of the GMS ontology, Germplasm.

258

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 3 - Figure 2. The Germplasm Method section of the first version of the GMS ontology.

259

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 3 - Figure 3. The Germplasm Identifier section of the first version of the GMS ontology.

260

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

APPENDIX 4 - GMS BASELINE ONTOLOGY (VERSION 2) This version of the GMS ontology mostly corresponds to the work done together with domain experts from the Australian Centre for Plant Functional Genomics and The International Center for Tropical Agriculture (CIAT). Protégé, version 3.1., was the ontology editor software used during the development of this ontology.

261

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 4 - Figure 1. Identified properties for the RSBI ontology.

262

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 4 - Figure 2. Genetic Constitution, as understood by the GMS ontology.

Appendix 4 - Figure 3. Germplasm Breeding Stock, a portion of the second version of the GMS ontology.

Appendix 4 - Figure 4. Naming convention according to the second version of the GMS ontology.

263

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 4 - Figure 5. Plant Breeding Method according to the second version of the GMS ontology.

Appendix 4 - Figure 6. PlantPropagationProcesses according to the second version of the GMS ontology.

264

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Appendix 4 - Figure 7. Some of the parent classes in the RSBI ontology.

265

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

APPENDIX 5 – PROTOCOL DEFINITION FILE GENERATED BY G-PIPE - - A workflow for a rodent phylogeny. The exon 28 of gene evWF (von Willebrand Factor) from different rodents are used for the analysis. - Exon 28 from vWF genes are aligned using clustalw - progressive multiple sequence alignment + + + + + + + + + + + + + + + + + + + + + + + + + + + + 266

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

+ + + + + + + + + + + + + + + + + + + + + + - The alignment result from previous step is used to build two phylogeny using two different methods from the Phylip set of methods - Parsimony method + + + + + + + + + + + + 267

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

+ + + + + + + + + - Distance method + + + + + + + + + + + + +

268

version="3.6a2"

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

INDEX

F

A Acronyms

241

Activity

Feasibility study and milestones

63

54, 237

Ad hoc developments

G

237

Application ontologies

35, 237

GBIF

241

GCG 175, 176, 177, 178, 191, 199, 211, 215, 238, 240

B

GO xvi, 31, 32, 71, 170, 171, 190, 219, 231, 233, 234,

BLAST

241

162, 241

BRENDA

180, 192, 241

BSML

G-PIPE xxvi, xxvii, 177, 178, 194, 195, 197, 198, 204, 205, 206, 207, 213, 214, 241

241

GUI 158, 159, 163, 164, 175, 176, 177, 178, 179, 186, 193, 195, 197, 199, 204, 211, 214, 241

C CODATA

xviii, 237

Competency questions

86, 93, 97, 132, 237

Concept maps

237

Control

xxviii, 60, 62

CPL

160, 172, 241

D Documentation processes Domain analysis Domain expert

H

158, 241

Communities of practice

HTML HUSAR

178, 240, 241, 242

I ICIS

xxvii, 130, 133, 136, 140, 241

IEEE

41, 52, 54, 58, 59, 72, 73, 74, 77, 78, 106, 188,

190, 235, 241

60 64, 90, 94, 101, 112, 238

99, 176, 177, 178, 191, 206, 241

Inbound-interaction

62

50, 57, 64, 65, 87, 90, 93, 94, 97, 100,

J

110, 125, 133, 140, 226, 238, 259 Domain ontologies

35, 238

DTD

Jemboss

176, 179, 241

241

K

E EMBOSS 175, 176, 178, 179, 191, 196, 199, 205, 211, 215, 240, 241

KAON

135, 142, 144, 247

KEGG

180, 192, 241

Knowledge

evolution xi, 44, 45, 49, 50, 56, 57, 72, 74, 75, 87, 88,

xviii, xxvii, xxviii, 33, 34, 36, 42, 43, 46,

47, 51, 52, 53, 63, 71, 76, 77, 78, 87, 106, 107, 108,

116, 119, 120, 121, 155, 158, 166, 186, 221, 224

269

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

Protégé xxvi, xxvii, 81, 95, 96, 99, 100, 101, 104, 110,

116, 119, 127, 131, 142, 143, 151, 172, 189, 190,

111, 113, 114, 115, 116, 118, 120, 127, 132, 133,

235, 238 Knowledge acquisition

142, 147, 234, 239, 243, 259, 262

63, 71, 77, 190

Knowledge elicitation 36, 42, 43, 46, 47, 63, 119, 238

PSI

147, 151, 158, 242

R

L Life Cycle

77, 238

Relevant scenarios

239

RSBIxxi, xxv, xxvii, 80, 92, 93, 99, 105, 108, 129, 145,

M

146, 147, 148, 149, 150, 193, 242, 243, 244, 245, 263, 266

MAGE

xxix, 158, 188, 220, 227, 241

MAGPIE

158, 188

Mailing lists

S

119

Management processes

62

Method

76, 239, 260, 265

MGEDxvi, xix, xx, xxviii, xxix, 77, 80, 92, 93, 94, 105,

Scheduling SOAP SQL

160, 162, 164, 175, 185, 202, 242

242

220, 227, 231, 235, 238, 241 MO

164, 179, 242

SRS 158, 160, 163, 164, 165, 178, 179, 183, 186, 188,

108, 125, 129, 145, 146, 147, 148, 150, 158, 193, MIAME

60, 61, 62

xxi, xxix, 91, 145, 146, 150, 158, 241

SW

44, 49, 50, 57, 73, 75, 87, 88, 119, 222, 239, 242

xvi, xix, 49, 61, 94, 125, 220, 231, 232, 241

MOBY

T

164, 166, 175, 239 TAMBIS

O On-the-Ontology comments

Task 62

Onthology

239

OQL

160, 242

Outbound-interaction

171, 172, 183, 190, 221, 227, 242

63

P

35, 178, 239, 242

Task ontologies

35, 239

TAVERNA

197, 204, 211, 240

Technique

240

Terminology extraction

85, 86, 132, 240

Text mining

240

Text2ONTO

134, 240

The Bernaras methodology PATH PISE

178, 191, 201, 203, 242 xxvii, 193, 195, 196, 204, 205, 206, 242

PO 32, 52, 218, 227, 242 PRECIS Process

36, 40

The DILIGENT methodology

36

The Enterprise Methodology

36

The METHONTOLOGY methodology

36, 41

158, 188 41, 78, 176, 239

U UNIX

270

175, 240

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c

H F-XC A N GE

H F-XC A N GE

c u-tr a c k

W N O y bu to k w

X

W W2H

176, 177, 178, 183, 191, 196, 215, 240

W3H

176, 177, 178, 179, 196, 240

wiki pages WIT Workflow

XML 99, 154, 156, 162, 164, 173, 174, 176, 177, 178, 183, 185, 190, 195, 202, 204, 205, 206, 212, 214, 242

61, 62 180, 192 177, 196, 205, 240

Xpath XQL

271

173, 242 173, 202, 242

.d o

o

.c

m

C lic

m o

.d o

w

w

w

w

w

C lic

k

to

bu

y

N O

W

!

PD

!

PD

c u-tr a c k

.c