Ontological Self-Organizing Maps for Cluster Visualization and Functional Summarization of Gene Products using Gene Ontology Similarity Measures. Timothy C. Havens, James M. Keller, Mihail Popescu, and James C. Bezdek.
Abstract— This paper presents an ontological self-organizing map (OSOM), which is used to produce visualization and functional summarization information about gene products using Gene Ontology (GO) similarity measures. The OSOM is an extension of the self-organizing map as initially developed by Kohonen, which trains on data composed of sets of terms. Term-based similarity measures are used as a distance metric as well as in the update of the OSOM training procedure. We present an OSOM-based visualization method that shows the cluster tendency of the gene products. Also demonstrated is an OSOM-based functional summarization which produces the most representative term(s) (MRT) from the GO for each OSOM prototype and, subsequently, each gene product cluster. We validated the results of our method by applying the OSOM to a well-studied set of gene products.
I. I NTRODUCTION
T
HE algorithms proposed in this paper are designed to work with data sets composed of collections of terms organized in a hierarchical taxonomy. We refer to these data sets as ontological data. These are typically composed of hundreds of dimensions (individual terms) and can also be very large—on the order of 10,000 samples. One way that researchers have dealt with conventional high-dimensional data sets is to employ self-organizing maps (SOM), as initially proposed by Kohonen [1], [2], [3]. The SOM allows these types of data to be effectively visualized in two or three dimensions by combining the goals of both projection and clustering algorithms [4]. We apply a novel extension to the SOM that allows us to use the SOM with ontological data. Ontological data is unique in that the data samples are composed of collections of terms or words, taken from a predefined corpus. Hence, the samples do not have a numerical location, unlike conventional object data. Examples of ontological data could include web sites, medical record annotations, and publications, among others. We apply our ontological selforganizing map (OSOM) to produce cluster visualization Timothy Havens is with the Department of Electrical and Computer Engineering, University of Missouri-Columbia, Columbia, MO 65211, USA (corresponding author: phone: 573-882-6387 fax: 573-882-0397 email:
[email protected]). James Keller is with the Department of Electrical and Computer Engineering, University of Missouri-Columbia, Columbia, MO 65211, USA (email:
[email protected]). Mihail Popescu is with the Health Management and Informatics Departement, University of Missouri-Columbia, Columbia, MO 65211, USA (email:
[email protected]). James Bezdek is currently visiting the Department of Electrical and Computer Engineering, University of Missouri-Columbia, Columbia, MO 65211, USA (email:
[email protected]).
Fig. 1.
OSOM training block diagram.
and functional summarization of annotated gene products in the Gene Ontology (GO). The relational data of the gene products are produced by GO similarity measures as described in [5]; however, any quantitative similarity measure could be used, such as those described in [6], [7]. Figure 1 is a block diagram of the OSOM training algorithm. First, we describe how we utilize the OSOM to produce cluster visualization of the gene products. The visualization method maps the “gene product profiles” (the OSOM prototypes) of the OSOM network to a two-dimensional toroidal grid. In order to show the cluster tendency of gene products, the relations between neighboring gene product profiles on the grid are displayed as gray levels—black representing no relation and white representing highly related. The functional summarization of each gene product profile (gene product cluster) is a direct result of our formulation of the OSOM. Each prototype of the OSOM network is represented by a vector of weights, where each element of the weight vector is associated with a term from the GO corpus. The value of the weight is the membership of the associated term in the description of the gene product profile. Thus, functional summarization of each gene product profile is the GO term(s) with the largest corresponding weight vector element(s). The illustrative results presented in this paper are based on a set of 194 human gene products. These gene products were retrieved on December 10, 2003 using the ENSEMBL browser [http://www.ensembl.org]. Table I shows the attributes of the families present in the data set, according to
TABLE I C HARACTERISTICS OF THE GP D19412.10.03 DATA SET EXTRACTED FROM ENSEMBL [8] ENSEMBL family ID 339 73 42
Fi = Protein family myotubularin receptor precursor collagen alpha chain
Ni = No. of sequences 21 87 86
Markov clustering [9]. We have used this gene product data set in past publications and, for comparative purposes, use it to illustrate the results of our current method. We call this set GP D19412.10.03 . A. Previous Work There exists a rich bed of research on clustering and functional summarization of gene products. The survey [10] describes many computational methods for clustering gene products, including hierarchical clustering, mutual information, and self-organizing maps. Of note is the GENECLUSTER algorithm, which uses SOMs to cluster gene products based on their gene expression pattern [11]. Our method is similar in spirit to a recent effort by Brameier and Wiuf [12], which the authors proposed a clustering and visualization method using both gene expression data and GO terms. However, Brameier and Wiuf mapped the gene product annotations to a reduced vocabulary of generalized GO terms. We establish a soft similarity measure-based method to utilize the actual GO annotations for each gene product, thus preserving the specificity of the GO terms in the training data. Also, we utilize only the GO to produce the cluster and visualization information. Hence, our method allows for knowledge discovery within the GO database itself or other term-based databases that are organized as directed acyclic graphs [13]. [14] describes a method that utilizes SOMs to cluster multilingual documents with a termbased vector-space model for the SOM prototypes. Our method uses a similar representation but we extend the SOM update equation to leverage the pair-wise similarities between ontology terms. In section II we briefly discuss the GO similarity measures used in this paper. Section III outlines in detail the OSOM algorithm as well as the visualization and functional summarization methods. Section IV wraps up this paper. II. G ENE O NTOLOGY S IMILARITY M EASURES Information about gene products and how they are similar to one another is of great importance in bioinformatics. Traditional approaches use the the DNA sequence as well as the expression values from microarray experiments. However, additional information is available about gene products, which is more symbolic in nature, including Gene Ontology (GO) terms [15] and index terms in publications about gene products [16]. We use these symbolic data about gene products to build visualization and functional summarization. Previously, we developed methods of computing similarity measures for two gene products that are annotated by GO
terms. These similarity measures are described in detail in [5], [17]. To summarize, each gene product, Gi , is represented by a collection of terms Gi = {Ti1 , ..., Tin }. Based on these sets of terms, a similarity between two gene products can be found by performing an aggregation on the pairwise similarities among each set of terms. For example, the average is computed as: Pn Pm i=1 j=1 sij , (1) S(G1 , G2 ) = mn where G1 is annotated by n terms, G2 is annotated by m terms, and sij is the pair-wise similarity between terms. The term pair-wise similarity is computed as in [17], [18], [19], where the shortest paths between two terms on the hierarchical tree as well as information theoretic constructs are used. For this paper, we combine all the terms from a set of gene products into one large set and compute a pair-wise similarity matrix R with eq.(1). Note that any aggregation operator could be used to compute R. The GP D19412.10.03 data set contains 64 total terms; thus, the similarity matrix R is 64x64. The pre-computed similarity matrix allows us to quickly compute gene product similarities by casting many of the operations in the OSOM as matrix-vector multiplications. For the rest of this paper, we denote the gene products as a binary vector, ~gi ∈ {0, 1}NT . This vector represents the i-th gene product and is an NT length binary vector, where NT is the total number of terms represented in the similarity matrix, R. In this paper, NT = 64. A vector element gik = 1 indicates that the gene product is annotated by the GO term indexed k. Values of 0 indicate the term is not present in the gene product annotation. III. O NTOLOGICAL S ELF -O RGANIZING M AP The self-organizing map is a two-layer lateral feedback neural network that topologically maps itself to the training data. The network structure is often set to a two-dimensional square, toroidal, or hexagonal grid, where each network node, or prototype, is laterally connected to its neighbors. The network learning algorithm is as follows: 1) Randomly draw a sample from the training data, ~xd . 2) Find closest SOM prototype p according to a chosen distance metric, (old)
p = arg min{||~xd − w ~i
||}.
(2)
i
3) Update SOM prototypes by (new)
w ~i
(old)
=w ~i
(old)
+ (t) · hip · (~xd − w ~i
),
(3)
where (t) is the learning rate and hip is the neighborhood function defined as |~ai − ~ap |2 , (4) hip = exp − σ 2 (t) where ~ai is the location of the SOM prototype in the predefined neighborhood (i.e. square or hexagonal grid). This algorithm is repeated until a maximum number of iterations is reached. Typically, the learning rate (t) and
the width of the neighborhood function σ 2 (t) are reduced during iteration, with the effect that late iterations are only updating network prototypes local to the winning prototype p. The algorithm we propose as the OSOM is an adaption of the standard SOM to ontological data. First, we construct an ontological weight vector for each node in the OSOM grid, which is a fuzzy membership representation of all the terms present in the training data. For example, in the GP D19412.10.03 data set there is a total of 64 terms among all the gene products combined; thus, the OSOM weight vector has a length of 64. Each weight vector element is associated with one GO term and the value of the weight vector is the membership of the associated term in the description of the “gene product profile”, which the OSOM prototype represents. We denote the OSOM weight vectors as w ~ i ∈ [0, 1]D , where D = NT = 64, the total number of terms. Second, we replace the distance metric in step 2 of the SOM with a similarity measure. The measures we use are vector-matrix multiplication-based operations that are simple extensions of the measures described in section II and [5]. The measures we present in this paper are as follow: • Average (AVG): S (AV G) (w ~ i , ~gj ) =
•
w ~ iT R~gj , NT |~gj |
(5)
where R is the term-similarity matrix and ~gj is the j-th training data vector as presented in section II. Normalized Average (NAVG): S (N AV G) (w ~ i , ~gj ) =
S (AV G) (w ~ i , ~gj ) , max{rA , rj }
where p denotes the closest OSOM prototype to the randomly (old) chosen training vector ~gd and where (F (R, ~gd ) − w ~ i ) is the update operator. As shown below, the update operator is computed from the columns of the similarity matrix that correspond to non-zero elements of the training vector ~gd . These columns of the similarity matrix represent the similarity between the non-zero terms in ~gd and all other terms (i.e. Rii = 1 because a term is perfectly similar to itself). Hence, the update operator, F (R, ~gd ), computes a row aggregation on the the similarity matrix R and the training vector ~gd , producing the update step for the OSOM prototypes. The operator F can be modeled after any aggregation operator [20]; for simplicity, we define F as one of the following: • Average (AVG):
rA = and rj =
1 (NT )2
NT X NT X k=1 l=1
NT 1 X (R~gd )k , |~gj |2 k=1
where (R~gd )k is the k-th element of the matrix-vector multiplication. Linear Order Statistic (LOS): ~l = (Rw ~ i ). ∗ ~gj , where .∗ represents element-by-element multiplication. The vector l is then sorted in descending order, l(1) > l(2) > ... > l(NT ) , and the LOS similarity is computed by, NL 1 X l(k) , (7) S (LOS) (w ~ i , ~gj ) = NT NL k=1
where NL is the number of elements of ~l that are averaged (we use NL = 4 for NT = 64). For example, an NL = 1 would result in the similarity being set
(9)
Maximum (MAX): (M AX)
Rkl ,
R~gd . |~gd |
F (AV G) (R, ~gd ) =
(6) •
where
•
to the maximum pair-wise similarity value between the individual elements of the OSOM weight vector w ~ i and the gene product term vector ~gj . Finally, we replace the weight vector update equation in step 3 of the SOM with a similarity-based update. In order to create an update equation, we define two axioms that the equation must satisfy: 1) The weight vectors must move closer to the randomly chosen training data vector, ~gd , at each iteration. 2) The weight vectors must also move closer to terms that are similar to the terms in ~gd . With these axioms in mind, we created the following update equation (new) (old) (old) w ~i =w ~i +(t)·hip · F (R, ~gd ) − w ~i , ∀i, (8)
Fk
(R, ~gd ) = max{Rki }, i
(10)
where i = {l ∈ N|l ≤ NT ; (gd )l = 1}, k = 1, ..., NT , and Rki is the i-th column of the k-th row of the similarity matrix R. The operator chosen for F determines the convergence behavior of the values of the prototype weight vectors, {w ~ i }. For example, F (AV G) causes the weight vectors to have maximum values around 0.5, as the operator averages the similarity values for all terms in the training vectors. Contrastively, F (M AX) causes the maximum weight vector values to be close to 1, as there is exactly |~gd | terms equal to one in the matrix-vector multiplication R~gd (each term in ~gd has similarity of 1 to itself). Simply put, F (M AX) pushes the OSOM prototypes towards the terms present in ~gd and, additionally, push the prototypes towards all the terms represented in R that are similar to any one of the terms in ~gd . Algorithm 1 outlines the complete OSOM algorithm. The parameters, such as the learning rates and maximum iterations, are set according to the problem. For this paper, we use a toroidal grid-based network with 400 neurons (20x20).
Fig.2 illustrates the shape of the network topology. The learning rates are {0 = 0.5, f = 0.005}, the widths of the lateral influence function in eq.(4) are {σ0 = 3.0, σf = 0.1}, and the maximum number of iterations is tmax = 10, 000. Algorithm 1: Ontological Self-Organizing Map. Data: ~gi , i = 1, ..., NG where ~gi is the i-th vector of the training data. Randomly initialize OSOM prototype weight vectors, w ~ i , i = 1, ..., Nr , on the interval [0, 1]. t←0 while t < tmax do 1) Randomly draw a single training data vector, ~gd . 2) Find closest prototype, p = arg mini S(w ~ i , ~gd ). 3) Update prototypes weight vectors with eq.(8). 4) σ(t) = σ0 (σf /σ0 )t/tmax . 5) (t) = 0 (f /0 )t/tmax . 6) t ← t + 1 end
Fig. 3. OSOM network mapping of GP D19412.10.03 data set - AVG GO similarity measure eq.(5), MAX update operator eq.(10).
Fig. 2.
Toroidal grid SOM network.
A. Gene Cluster Visualization The visualization method we propose is composed of two distinct steps. First, the gene products are mapped to the trained OSOM network by the nearest prototype rule—for each gene product ~g find the best match prototype node with p = arg mini S(w ~ i , ~g ). Then annotate the prototype node p with the gene product information of ~g . This groups similar gene products into clusters, where the points in each cluster are associated to one nearest OSOM prototype node. Second, the similarity between neighboring OSOM prototype nodes is mapped into a gray-scale image—white showing high similarity, black showing very low similarity [4]. Fig. 3 illustrates this mapping using S (AV G) , eq.(6), and F (M AX) , eq.(10). The white regions correspond to groups of similar gene product clusters, while the black regions show the boundaries between regions that are dissimilar. Please note that due to the toroidal topology of the OSOM network, the top and bottom, as well as the sides, wrap around. The similarity between nodes is calculated by an average operator, w ~ T Rw ~j S (OSOM ) (w ~ i, w ~j) = i 2 , (11) NT
and this similarity is calculated between each node of the OSOM network in the up-down, left-right, and four diagonal directions. Thus, each prototype node has eight surrounding pixels which correspond to its relation to neighboring nodes. The gray-scale colormap is set such that white corresponds (OSOM ) to max∀i,∀j S (w ~ i, w (OSOM ) ~ j ) and black corresponds to min∀i,∀j S (w ~ i, w ~ j ) for a given network. The color at the node locations is interpolated from the eight surrounding pixels. As a result of this method of coloring the OSOM map, regions that are lightly colored represent groups of similar gene products, while darker regions signify outliers or gene products that are dissimilar to the surrounding groups. In addition, the degree of similarity can be seen in the intensity of the regions. For example, in Fig. 3 the light region on the right is a highly similar group, while the more gray regions signify similarity to a lesser degree, and the black regions denote boundaries between dissimilar groups of gene products. The three GPD194 families can be seen in Fig. 3 as lightcolored islands. The collagen alpha chains are located in the top-left and bottom-left (recall that the grid is toroidal; hence, these two regions are actually connected). The myotubularins are located at the top-right and bottom-right. Lastly, the receptor precursors, which are the most tightly grouped gene products (they are mapped to a bright region), are located at the right-middle of the image. Figure 4 shows the associated gene products of the four OSOM nodes at coordinates (6,17), (6,18), (6,19), and (6,20) from the example shown in Figure 3. This zoomed-in view of the OSOM map clearly shows that the top two gene
(a) OSOM prototype (6,20)
Fig. 4. Zoomed-in view of OSOM network mapping of GP D19412.10.03 as displayed in Fig. 3.
product profiles and the bottom gene product profiles are all similar, as they are located in a connected light colored region. However, the other pictured gene product profile (gene product indices 120-121) is not similar to the others due to the surrounding dark colored region. The gene product group {120,121} has been shown to be an outlier in our previous research using this data set and we have found corroborating evidence which supports our claim that there is substructure within the collagen family [21]. Furthermore, Fig. 3 contains a similar dark region at location (10,3). The gene products mapped to this region are indexed 30 and 107. These two gene products were found to have annotation errors in our previous research, and were later corrected in March 2004. B. Functional Summarization of Gene Product Clusters Functional summarization of the gene product profiles is achieved by examining the OSOM prototype weight vectors. The ontological content of each OSOM prototype is represented by a weight vector, as discussed in section III. Each element of the weight vector can be viewed as the influence of a specific GO annotation in defining the profile of its associated OSOM prototype. Thus, high values in a weight vector signify a high likelihood that the gene products mapped to that location in the OSOM are annotated by that specific term or a term that is very similar, according to the specified term-based similarity measure. We define the most representative term (MRT) of a gene product profile as the term that has the highest associated weight in the OSOM prototype weight vector. Fig. 4 shows the MRTs for a zoomed-in portion of the
(b) OSOM prototype (6,18) Fig. 5.
Associated weight vector values of OSOM prototypes.
trained OSOM network. Fig. 5(a) is a plot of the weight vector associated with the OSOM prototype at location (6,20). The term indexed 15 is the MRT for the group of gene products mapped to this OSOM prototype. The MRT for this group is GO:0005201, which is defined as extracellular matrix structural constituent. This is also the MRT for the OSOM prototype at location (6,18). In comparison, Fig. 5(b) are the weight vector values for the OSOM prototype located at (6,18). The weights indexed 1-40 have a very similar structure in both profiles, while there is a vast difference among the higher indexed weights. This is a strength of our visualization method as these two groups of gene products are quite similar in many regards, as evidenced by the OSOM visualization image; however, these groups are mapped to different locations due to minor differences in their ontological data. Table II outlines the MRTs for the entire trained OSOM network as shown in Figs. 3 and 4.
TABLE II M OST REPRESENTATIVE TERMS . OSOM Index (17,1) (2,3) (10,3) (16,4) (1,5) (17,7) (16,8) (19,10) (13,15) (10,16) (6,17) (6,18) (14,18) (6,19) (6,20)
GO ID GO:0006470 GO:0005201 GO:0016301 GO:0006470 GO:0005201 GO:0004872 GO:0004713 GO:0005524 GO:0006470 GO:0005201 GO:0005201 GO:0005201 GO:0006470 GO:0007155 GO:0005587
GO Definition Protein amino acid dephosphorylation Extracellular matrix structural constituent Kinase activity Protein amino acid dephosphorylation Extracellular matrix structural constituent Receptor activity Protein-tyrosine kinase activity ATP binding Protein amino acid dephosphorylation Extracellular matrix structural constituent Extracellular matrix structural constituent Extracellular matrix structural constituent Protein amino acid dephosphorylation Cell adhesion Collagen type IV
IV. C ONCLUSIONS This paper has presented a novel extension to the selforganizing map (SOM), which we call the ontological selforganizing map (OSOM). The OSOM is used to produce visualizations of ontological data in a manner that can suggest how the data groups together. A method of producing the most representative term (MRT) of each group is also presented. In the case of gene product data, the MRT can be considered a functional summarization of the grouped gene products. We applied the OSOM-based methods of visualization and functional summarization to the GP D19412.10.03 data set, which is composed of 194 human gene products. It was shown that the OSOM produced visualization information that supports that there are three main groups of gene products and sub-structures within these groups. The OSOM was also effective in identifying the two groups of gene products which did not seem to belong. These two groups, namely gene products {120,121} and {30,107}, are known to be outliers within the cluster structure of this data set. Although we tailored the examples in this paper to gene product visualization and summarization, the methods presented could be applied to any ontological data upon which a term-based similarity measure could be developed. A. Future Work In the future, we will utilize the OSOM-based methods to analyze larger sets of test problems and will apply the OSOM to other data types, such as web-documents, medical records, or periodicals. We are also currently working on a number of improvements, including three-dimensional interactive visualization and summarization, network prototype grid sub-sampling, and automated creation of hierarchal trees composed of summarizing annotations. These tools will allow data-miners to visualize ontological data in multiple and, perhaps, synergistic ways. ACKNOWLEDGMENT Timothy Havens would like to thank his Mom and Dad for their loving support.
R EFERENCES [1] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, pp. 59–69, 1982. [2] ——, “Self-organizing maps,” Proc. IEEE, vol. 78, no. 9, pp. 1464– 1480, September 1990. [3] ——, Self-Organizing Maps, ser. Information Sciences. Berlin: Springer, 2001, vol. 30. [4] S. Kaski and T. Kohonen, “Exploratory data analysis by the selforganizing map: Structures of welfare and poverty in the world.” in Neural Networks in Financial Engineering. Proceedings, Third International Conference on Neural Networks in the Capital Markets, P. N. Refenes, Y. Abu-Mostafa, J. Moody, and A. Weigend, Eds. London, England: World Scientific, Singapore, 1996, pp. 498–507. [5] J. Keller, M. Popescu, and J. Mitchell, “Taxonomy-based soft similarity measures in bioinformatics,” in Proc. IEEE Int. Conf. on Fuzzy Systems. Budapest, Hungary: IEEE, July 2004, pp. 23–30. [6] M. Popescu, J. Keller, and J. Mitchell, “Fuzzy measures on the Gene Ontology for gene product similarity,” IEEE Trans. on Computational Biology and Bioinformatics, vol. 3, no. 3, pp. 263–274, 2006. [7] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” Nucleic Acids Res., vol. 25, pp. 3389–3402, 1997. [8] J. Stalker, B. Gibbins, P. Meidl, J. Smith, W. Spooner, H. Hotz, and A. Cox, “The Ensembl web site: Mechanics of a genome browser,” Genome Res., vol. 14, no. 5, pp. 951–955, May 2004. [9] A. Enright, S. VanDongen, and C. Ouzounis, “An efficient algorithm for large-scale detection of protein families,” Nucleic Acids Res., vol. 30, no. 7, pp. 1575–1584, 2002. [10] J. Quackenbush, “Computational analysis of microarray data,” Nat. Rev. Genet., vol. 2, no. 6, pp. 418–427, 2001. [11] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander, and T. Golub, “Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation,” Proc. Natl. Acad. Sci., vol. 96, no. 6, pp. 2907– 2912, March 1999. [12] M. Brameier and C. Wiuf, “Co-clustering and visualization of gene expression data and gene ontology terms for saccharomyces cerevisiae using self-organizing maps,” J. of Biomedical Informatics, vol. 40, no. 2, pp. 160–173, April 2007. [13] F. Harary, Graph Theory. Reading, MA: Addison-Wesley, 2004. [14] M. Pham, D. Bernhard, G. Diallo, R. Messai, and M. Simonet, Datamining with ontologies: Implementations, findings, and frameworks. Harrisburg, PA: IGI Global, 2007, ch. SOM-based clustering of multilingual documents using an ontology, pp. 65–83. [15] The Gene Ontology Consortium, “The Gene Ontology (GO) database and informatics resource,” Nucleic Acids Res., vol. 32, pp. D258– D261, 2004. [16] S. Raychaduri and R. Altman, “A literature-based method for assessing the functional coherence of a gene group.” Bioinformatics, vol. 19, no. 3, 2003. [17] J. Keller, J. Bezdek, M. Popescu, N. Pal, J. Mitchell, and J. Huband, “Gene ontology similarity measures based on linear order statistics,” Int. J. on Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 14, no. 6, pp. 639–661, 2006. [18] P. Lord, R. Stevens, A. Brass, and C. Goble, “Semantic similarity measure as a tool for exploring the gene ontology.” Pacific Symposium on Biocomputing, pp. 601–612, 2003. [19] J. Jiang and D. Conrath, “Semantic similarity based on corpus statistics and lexical ontology.” Proc. of Int. Conf. Res. on Comp. Linguistics X, 1997. [20] G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Upper Saddle River, New Jersey: Prentice Hall, 1995. [21] J. Myllyharju and K. Kivirikko, “Collagens, modifying enzymes, and their mutation in humans, flies, and worms,” Trends in Genetics, vol. 20, no. 1, pp. 33–43, 2004.