PROTERAN: ANIMATED TERRAIN EVOLUTION FOR. VISUAL ANALYSIS ...... the form of 23 chromosomes, which are made up of entwined deoxyribonucleic acid. (DNA) [1, 2]. .... 1 http://www.dpz.gwdg.de/clineu/people/abumaria/bild.jpg. 5 ...
PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY Kush Kapila
A Thesis in The Department of Computer Science
Presented in partial fulfillment of the requirements for the degree of Master of Computer Science at Concordia University Montreal, Quebec, Canada
August 2004
© Kush Kapila, 2004
CONCORDIA UNIVERSITY School of Graduate Studies
This is to certify that the thesis prepared By:
Kush Kapila
Entitled:
PROTERAN: Animated Terrain Evolution for Visual Analysis of Protein Folding Trajectory
and submitted in partial fulfillment of the requirements for the degree of Master of Computer Science complies with the regulations of the University and meets the accepted standards with respect to originality and quality.
Signed by the final examining committee: Chair Dr. David Ford Examiner Dr. Gregory Butler Examiner Dr. Volker Haarslev Supervisor Dr. Sudhir Mudur Approved by Chair of Department or Graduate Program Director 2004
Dr. Nabil Esmail, Dean Faculty of Engineering and Computer Science
ABSTRACT PROTERAN: Animated Terrain Evolution for Visual Analysis of Protein Folding Trajectory Kush Kapila
In the field of bio-informatics, the analysis of voluminous data is becoming increasingly crucial to understanding the underlying biology and answering important questions. Various clustering techniques such as Hierarchical, SOMs, K-means and PCA are being used to cluster gene expression data to find the functions of unknown genes. Even these sophisticated algorithms are futile if the results are not appropriately interpreted, thus visualization techniques play an important role in analyzing data. Similar to clustering of gene expression data is that of clustering the characteristics of protein folding trajectory and a good visualization tool can help visual analysis and can provide faster and deeper insights into the manner in which a protein folds. With protein characteristics data and specific visualization requirements provided by Dr. Laxmi Parida and Dr. Ruhong Zhou of the Computation Biology Group at the IBM T. J. Watson Research Center, a new 3D visualization technique was designed and developed. This customized technique helps identify the major states a protein folds into through the use of an animated terrain. This technique was implemented as part of the interactive visualization program PROTERAN and tested with the β-Hairpin clustered data provided.
iii
Acknowledgements I would like to express my sincerest appreciation to all those who helped in the completion of this thesis. I would especially like to thank my supervisor, Dr. S.P. Mudur for his knowledge, his encouragement and his guidance during my research. For this I am forever indebted to him. I would also like to thank Dr. Laxmi Parida and Dr. Ruhong Zhou of the Computational Biology Group at the IBM T. J. Watson Research Center for their patience and giving me the opportunity to work with them. I would finally like to thank my family (Mom, Dad, Aarti, Kamal, Jij, Bhabhi, Pooja and Neel) for their support. Last but not least I would like to thank my fiancée, Sandhya, without her this thesis would not have been possible.
iv
TABLE OF CONTENTS CHAPTER 1 –
Introduction .................................................................................... 1
1.1
Genes & Proteins ................................................................................................ 1
1.2
Microarrays (Gene-Chips/Protein Chips) ........................................................... 4
1.2.1
Gene-Chips ................................................................................................. 4
1.2.2
Protein Chips............................................................................................... 7
1.3
Clustering of data................................................................................................ 8
1.4
Visualization of Clusters..................................................................................... 8
1.5
Research Problem ............................................................................................... 9
1.6
Objectives of this Research............................................................................... 10
1.7
Overview of Thesis ........................................................................................... 10
CHAPTER 2 2.1
Clustering and Visualization Techniques ................................... 12
Expression Matrix............................................................................................. 12
2.1.1
Data Adjustments...................................................................................... 13
2.2
Similarity Metrics ............................................................................................. 13
2.3
Clustering Methods........................................................................................... 14
2.4
Hierarchical Clustering ..................................................................................... 14
2.4.1
Algorithm.................................................................................................. 15
2.4.2
Visualization ............................................................................................. 17
2.4.3
Advantages................................................................................................ 18
2.4.4
Disadvantages ........................................................................................... 19
2.5
Self Organizing Maps (SOMs) ......................................................................... 20
2.5.1
Algorithm.................................................................................................. 20
2.5.2
Visualization ............................................................................................. 22
2.5.3
Advantages................................................................................................ 25
2.5.4
Disadvantages ........................................................................................... 26 v
2.6
K-Means............................................................................................................ 26
2.6.1
Algorithm.................................................................................................. 27
2.6.2
Visualization ............................................................................................. 30
2.6.3
Advantages................................................................................................ 30
2.6.4
Disadvantages ........................................................................................... 31
2.7
Principal Component Analysis (PCA) .............................................................. 31
2.7.1
Algorithm.................................................................................................. 32
2.7.2
Visualization ............................................................................................. 34
2.7.3
Advantages................................................................................................ 34
2.7.4
Disadvantages ........................................................................................... 35
2.8
Biclustering ....................................................................................................... 35
2.8.1
Algorithm.................................................................................................. 36
2.8.2
Observations ............................................................................................. 38
2.9
Available Software............................................................................................ 38
CHAPTER 3 -
Protein Folding Trajectory Analysis .......................................... 40
3.1
Data ................................................................................................................... 41
3.2
Patterned Clusters ............................................................................................. 44
3.2.1
The Need for Visually Analyzing the Patterned Cluster Data.................. 46
3.2.2
Program Requirements.............................................................................. 47
3.3
3.2.2.1
Global View:..................................................................................... 47
3.2.2.2
Navigation and focus: ....................................................................... 47
3.2.2.3
Relative growth rates: ....................................................................... 48
3.2.2.4
Detailed Query:................................................................................. 48
The need for a Custom Visualization Tool ....................................................... 48
CHAPTER 4 4.1
PROTERAN .................................................................................. 50
Terrain Metaphor .............................................................................................. 50
vi
4.2
Mapping of Patterned Cluster Data into Terrain Geometry.............................. 51
4.2.1
Patterned Cluster Layout........................................................................... 54
4.2.2
Animated Terrain Evolution ..................................................................... 58
4.2.3
Interaction Facilities.................................................................................. 58
4.3
Programming Tools .......................................................................................... 59
4.4
PROTERAN ..................................................................................................... 60
4.4.1
Opening Screen......................................................................................... 60
4.4.1.1
‘Filter File’ Button ............................................................................ 60
4.4.1.2
‘Processed File’ Button..................................................................... 60
4.4.1.3
‘Quit’ button...................................................................................... 60
4.4.2
Main Screen .............................................................................................. 61
4.4.2.1
Fly Through and Landing on Mountain:........................................... 62
4.4.2.2
Data Caption: .................................................................................... 63
4.4.2.2.1
Pattern: ............................................................................................ 63
4.4.2.2.2
Hill Number: ................................................................................... 63
4.4.2.2.3
Column x:........................................................................................ 63
4.4.2.2.4
Height:............................................................................................. 63
4.4.2.2.5
Time: ............................................................................................... 64
4.4.2.2.6
FPS:................................................................................................. 64
4.4.2.3 4.4.2.3.1
Buttons: ............................................................................................. 64 Animation ....................................................................................... 64
4.4.2.3.1.1
Start:......................................................................................... 64
4.4.2.3.1.2
Stop: ......................................................................................... 65
4.4.2.3.2
Panning: .......................................................................................... 65
4.4.2.3.2.1
Left/Right:................................................................................ 65
4.4.2.3.2.2
Forward/Back:.......................................................................... 65
4.4.2.3.2.3
Stop Panning: ........................................................................... 65
4.4.2.3.3
Quick View: .................................................................................... 65
4.4.2.3.3.1
Top & Bottom Views:.............................................................. 65 vii
4.4.2.3.3.2 4.4.2.3.4
4.4.3 4.5
Pattern 2-7:............................................................................... 65
Miscellaneous: ................................................................................ 65
4.4.2.3.4.1
Save Image:.............................................................................. 65
4.4.2.3.4.2
Quit: ......................................................................................... 66
Other features:........................................................................................... 66
Testing and Release .......................................................................................... 67
CHAPTER 5 -
Conclusions and Extensions......................................................... 68
5.1
Initial Experience with PROTERAN................................................................ 68
5.2
Extensions ......................................................................................................... 69
References ........................................................................................................................ 71 Appendix .......................................................................................................................... 75 APPENDIX A............................................................................................................... 75 APPENDIX B ............................................................................................................... 77
viii
LIST OF FIGURES Figure 1-1:
DNA double helix structure ........................................................................ 2
Figure 1-2:
Nucleotide structure. ................................................................................... 2
Figure 1-3:
Microarray................................................................................................... 5
Figure 1-4:
High-speed robot used to imprint cDNA strands onto a slide .................... 6
Figure 1-5:
Process of creating a microarray................................................................. 7
Figure 2-1:
Steps 1-4 of Agglomerative Hierarchical Clustering................................ 16
Figure 2-2:
Dendrogram 1 ........................................................................................... 17
Figure 2-3:
Dendrogram 2 ........................................................................................... 17
Figure 2-4:
Dendrogram 3 ........................................................................................... 17
Figure 2-5:
Dendrogram and reordered expression matrix.......................................... 19
Figure 2-6:
Principle of SOMs..................................................................................... 22
Figure 2-7:
Hierarchical Clustering of SOM subsets................................................... 23
Figure 2-8:
Error bar representation of gene expression data...................................... 24
Figure 2-9:
U-matrix representation of the SOM ........................................................ 24
Figure 2-10:
Neurons in a SOM. ................................................................................... 25
Figure 2-11:
Example of K-means, k = 2 ...................................................................... 29
Figure 2-12:
The results of the k-means algorithm with different k-values. ................. 31
Figure 2-13:
Principal component analysis ................................................................... 32
Figure 2-14:
Visualization of PCA on gene expression. ............................................... 34
Figure 2-15:
Biclustering versus traditional clustering.................................................. 37
Figure 3-1:
A hypothetical state diagram of a folding protein. ................................... 41
Figure 3-2:
Patterned cluster........................................................................................ 44
Figure 4-1:
PROTERAN ............................................................................................. 51
Figure 4-2:
Layout of Pattern Types on the Ground Plane.......................................... 55
Figure 4-3:
Layout of Column Combinations.............................................................. 56
Figure 4-4:
Opening screen of PROTERAN ............................................................... 61
Figure 4-5:
Main Screen of PROTERAN.................................................................... 62
Figure 4-6:
Data Caption ............................................................................................. 63
Figure 4-7:
Buttons used in PROTERAN.................................................................... 64 ix
LIST OF TABLES Table 1-1:
The 20 amino acids and their codons.......................................................... 3
Table 2-1:
Expression Matrix..................................................................................... 12
Table 2-2:
Some Image analysis and Data Mining Software Packages ..................... 39
Table 3-1:
Sample data of β-hairpin's reaction coordinates over time. ...................... 43
Table 3-2:
Threshold added to each characteristic. .................................................... 43
Table 3-3:
Results after adding threshold................................................................... 44
Table 3-4:
Sample patterned cluster file..................................................................... 46
Table 4-1:
Patterns of type 2 ...................................................................................... 53
Table 4-2:
Number of permutations of each pattern type........................................... 54
Table 4-3:
Pattern 2 with combination 01 .................................................................. 57
Table 4-4:
List of the Pattern Type Colors................................................................. 66
x
CHAPTER 1 – Introduction With the recent fusion of biology and computer science known as the discipline of bioinformatics, the last few decades have shown an explosion in the availability of biological data for the scientific community. This can be seen with the recent completion of the Human Genome Project where almost 95% of the genome, consisting of 6 billion nucleotides, has been mapped giving biologists a master blueprint for all the cellular structures. The analysis of this data is crucial to understanding the underlying biology and for answering vital questions. Various techniques such as data mining and clustering are being used on large data to extract useful information. Even algorithms and analysis techniques would not be useful if the results are difficult to interpret or are not appropriately interpreted: thus visualization techniques play an important role in bridging this gap. Through the use of visualization techniques, scientists are able to understand their results better as well as get a different perspective of the data that would otherwise be lost in a very large text file.
1.1 Genes & Proteins The complete set of instructions for making an organism is called its genome and is found in every nucleus of a person’s trillions of cells [1]. This blueprint is organized in the form of 23 chromosomes, which are made up of entwined deoxyribonucleic acid (DNA) [1, 2]. DNA is a double helix structure (cf. Figures 1-1) of repeating nucleotide base pairs, with each nucleotide containing one phosphate group, one sugar and one base (guanine, thymine, cytosine or adenine) (cf. Figure 1-2). A sequence of three nucleotides
1
constitutes a codon, which is the protein coding vocabulary. The sequence of codons in a gene specifies the amino-acid sequence of the protein it encodes (cf. Table 1-1) [3].
Figure 1-1:
Figure 1-2:
DNA double helix structure
Nucleotide structure consists of one phosphate group, one sugar and one base (guanine, thymine, cytosine or adenine).
2
Table 1-1:
The 20 amino acids used in proteins and the codons that code for each of those amino acids.
Amino Acids
Codons
Amino Acids
Codons
Ala
GCU, GCC, GCA, GCG
Leu
UUA, UUG, CUU, CUC, CUA, CUG
Arg
CGU, CGC, CGA, CGG, AGA, AGG
Lys
AAA, AAG
Asn
AAU, AAC
Met
AUG
Asp
GAU, GAC
Phe
UUU, UUC
Cys
UGU, UGC
Pro
CCU, CCC, CCA, CCG
Gln
CAA, CAG
Ser
UCU, UCC, UCA, UCG, AGU,AGC
Glu
GAA, GAG
Thr
ACU, ACC, ACA, ACG
Gly
GGU, GGC, GGA, GGG
Trp
UGG
His
CAU, CAC
Tyr
UAU, UAC
Ile
AUU, AUC, AUA
Val
GUU, GUC, GUA, GUG
Start
AUG, GUG
Stop
UAG, UGA, UAA
- Ala, Arg, Asn etc. are shortened names of Amino acids. The full names can be found in any text book on biology [2]. - A, G, C, and U are shorthand for nucleotides Adenine, Guanine, Cytosine and Uracil found in Ribonucleic Acid (RNA). RNA is a form of genetic information, involved in the translation process from DNA into proteins. A combination of any of these three nucleotides makes a codon which specifies an amino acid [2]. Through the proteins they encode, genes govern the cells in which they reside. In multicellular organisms they control development of the individual from the fertilized egg and the day-to-day functions of the cells that make up tissues and organs. The instrumental roles of their protein products range from mechanical support of the cell structure to the transportation and manufacture of other molecules and to the regulation of other proteins' activities [3].
3
Although there are roughly 3 billion nucleotide base pairs only 10% of the genome is known to include the protein-coding sequences called genes [1]. Furthermore even though most of the Human Genome has been mapped and the location of these genes has been found what still remains to be discovered are the functions of the proteins they code for. Knowing the function of these genes and their proteins could lead to improved drugs and cures to many diseases.
1.2 Microarrays (Gene-Chips/Protein Chips) 1.2.1 Gene-Chips It is these thousands of genes and their corresponding proteins that work together in a complex fashion to perform the everyday functions of an organism. Traditionally, experiments on genes and proteins have been performed on a one-by-one basis. Thus the throughput was limited and a global view hard to achieve. This problem was solved with the advent of microarrays (cf. Figure 1-3) or better known by Affymetrix’s trademark name, Gene Chips.
Usually made on glass but sometimes on nylon substrates,
microarrays enable scientists to analyze genes on a genomic level [4].
4
Microarray1
Figure 1-3:
Using high-speed robots, like the one depicted in Figure 1-4, thousands of complimentary DNA (cDNA) strands can be printed onto a slide. cDNA is similar to DNA except it is single stranded. Each of these cDNA strands called spots represent a gene; an entire genome can be printed on the standard microarray size of 1x3 inches [5]. Cells are then allowed to grow in regular conditions called the control, and experimental conditions called the sample. The mRNA from each of these types of cells is extracted (cf. Figure 1-5 (A)). Each mRNA is then converted to fluorescently labeled cDNA by incorporating either Cye-3 or Cye-5-dUTP using a single round of reverse transcription (cf. Figure 1-5 (B, C)) [6, 7]. The control sample is labeled with Cye-3 dye and the sample with Cye-5 that are green and red colors respectively. Both types of cDNA are mixed together and then placed on the microarray to incubate with the spots on the chip (cf. Figure 1-5 (D)). Each cDNA has a corresponding spot on the chip to attach itself to. Four possibilities could occur with a spot (gene) when the microarray is allowed to incubate with the cDNA. 1. Only control cDNA attaches itself to the gene 1
http://www.dpz.gwdg.de/clineu/people/abumaria/bild.jpg
5
2. Only sample cDNA attaches itself to the gene 3. Both types of cDNA (Red, Green) attach themselves to the gene 4. Neither the control nor the sample cDNA attach to the gene Any unbound cDNA is then washed off and the microarray is placed in a black box. The microarray is then passed over with a green laser that creates an image of only the spots with green cDNA attached to it. This image is stored for later use. The same thing occurs with a red laser, this time only spots with red cDNA are stored. By means of software, both images are then superimposed on each other. Spots with more control cDNA will appear green while spots that contain more sample cDNA will appear red. Spots with an equal amount of both types of cDNA will appear yellow (green mixed with red).
Finally spots without any cDNA will appear black.
By
examining the fluorescence intensities we can compare the relative expression levels of thousands of genes in a control versus sample condition.
Figure 1-4:
2
High-speed robot used to imprint cDNA strands onto a slide2
http://www.research.vt.edu/resmag/resmag2001/s_maroof_lab.html
6
Figure 1-5:
Process of creating a microarray3
1.2.2 Protein Chips In addition to studying genes through microarray experiments, proteins can also be analyzed using protein chips [4]. With more than 3 million proteins versus some 40,000 genes there may be more of a need for protein chips. Protein chips fall into two categories: 3
http://www.fao.org/DOCREP/003/X6884E/x6884e00.jpg
7
1.
Capture chips - Count the number of proteins in a sample.
2.
Interaction chips - Look at protein-protein interactions.
Protein chips are a relatively new technology and their 3D nature along with their tendency to denature on contact with the chip makes their creation far more difficult then microarrays [8]. Gene-chips and protein chips are however just the first step to finding the functions of unknown genes and the proteins they produce.
1.3 Clustering of data A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns [9]. Complex algorithms are used to group together genes of unknown functions with those whose functions are known. This is done in an attempt to learn more about the gene whose functions are unknown. This process of organizing genes into biologically relevant groups is called clustering and there are three reasons for this [7]. 1.
Genes that share similar expression patterns are said to be coexpressed. Evidence supports that many functionally related genes are coexpressed.
2.
Coexpressed genes can reveal much about regulatory systems
3.
Gene expression varies greatly in different cell types and states
1.4 Visualization of Clusters While clustering algorithms have proven to be useful as a first pass in finding more information about genes and their proteins, without a proper visualization of the
8
clustered data, valuable information is lost. With thousands of clusters, scientists lose the overall picture of the data. In addition, the relationship between individual clusters is hard to determine, if presented to the scientist in textual formats. A good visual representation should allow scientists to keep a global view as well as observe the relationship between clusters. A visual representation can also give biologists a different perspective on the data.
1.5 Research Problem Understanding how a protein folds into a functional or structural configuration is arguably one of the most important and challenging problems in computational biology. The interest is not just finding the final fold but of finding the major states that a protein folds into. Because the computational power necessary to calculate a realistic all-atom simulation is far from feasible, scientists aim to simplify the folding process by identifying the major states. By calculating the characteristics of a protein over time while folding and then clustering these characteristics, it is possible to confirm existing states as well as identify new ones. Similar to microarray experiments a matrix of data is the result of calculating the characteristics of a protein over time. Also similar to microarray data a clustering technique needs to be applied in order to extract the useful information. In our research we have addressed the problem of visualization of the above types of data. The motivation for this research is part of our collaboration with two scientists of the Computational Biology Group at the IBM T. J. Watson Research Center in New
9
York. Dr. Laxmi Parida and Dr. Ruhong Zhou of this center are studying the effects of the β-Hairpin protein folding over time. Comparable to microarray data, clustering of the protein’s characteristics proved to be a useful first step. With the very large data set that resulted there was a need for an appropriate visualization tool.
We carried out a
comprehensive survey of scientific visualization tools available in the public domain. However a single tool that provides all the visualization facilities required for this purpose could not be found. Hence it was decided that we would develop our own customized visualization software.
1.6 Objectives of this Research The main objectives of the research are: 1.
Carry out a comprehensive study of visualization techniques and tools suitable for protein state data
2.
Develop a customized visualization software with the following capabilities •
Provide a visual global view of the data
•
Enable visual identification of major data clusters
•
Enable visualization of relative growth of data clusters over time
1.7 Overview of Thesis Chapter 2 provides a comprehensive survey of the current state of the art in clustering and visualization techniques used for gene expression data. In Chapter 3 we 10
describe some details of the protein folding experiments carried out by Dr. Parida and Dr. Zhou and their technique of using patterned clusters for gaining insight into the different states of a protein during the folding operation. In this chapter we also bring out the need for a customized visualization tool. In Chapter 4 we describe the new visualization techniques that we have created to visualize patterned clusters through the use of an animated terrain. Protein Terrain Analyzer or PROTERAN for short uses mountains that grow to reflect the evolution of major states over time. PROTERAN is a novel approach to visualizing patterned clusters and was successfully proven on the β-hairpin protein data provided by Dr. Parida and Dr. Zhou. Chapter 5 outlines the results of PROTERAN for Dr. Parida and Dr. Zhou as well as the future work to be done. Finally Appendix A gives a list of distance metrics used to compare genes for clustering while Appendix B provides the mathematics necessary to perform Principal Component Analysis, a common clustering technique.
11
CHAPTER 2 - Clustering and Visualization Techniques Clustering is the process of organizing objects into groups whose members are similar in some way [10]. In gene expression, genes of unknown function are clustered with genes of known function in an attempt to learn more about the former4. This then poses the problem of how to decide how similar two genes are. In order to determine the similarity of two genes, it is necessary to adopt a mathematical description of similarity [11].
Before a mathematical description can be formulated, the exact data that
characterizes a gene must be examined.
2.1 Expression Matrix The data obtained from a microarray consists of a matrix of fluorescence intensity values for N genes by M experiments. These values are obtained by taking the ratio of C5 (red dye) to C3 (green dye) (cf. Table 2-1).
Table 2-1:
Expression Matrix Experiment 1
Experiment 2
………..
Experiment M
Gene 1
C511/C311
C512/C312
………..
C51M/C31M
Gene 2
C521/C321
C522/C322
………..
C52M/C32M
. . . Gene N
. . . C5N1/C3N1
. . .
. . . C5N2/C3N2
4
………..
. . . C5NM/C3NM
Though we present clustering and visualization for gene expression data the same techniques are applicable to protein state characteristics.
12
Therefore a gene in which the sample is overexpressed (i.e. sample is greater than control) would have a fluorescence value greater than one (C5>C3). A gene that is underexpressed (C3>C5) would have a value of less than one and greater than zero. This however does not lend itself to a well dispersed range. Overexpressed genes range between one and infinity while underexpressed genes range only between zero and one. Therefore by taking the logarithmic of the ratio, overexpressed genes will map to positive values and underexpressed genes to negative values. Thus each value in the matrix can now be replaced with log2(C5ij/C3ij) where i (0