Genome Informatics 13: 163–172 (2002)
163
Folding Pattern Recognition in Proteins Using Spectral Analysis Methods Carlos A. Del Carpio-Mu˜ noz
Julio Cesar Carbajal L.
[email protected]
[email protected]
Laboratory for Bioinformatics Department of Ecological Engineering, Toyohashi University of Technology, Tempaku, Toyohashi 441-8580, Japan
Abstract Divergence in sequence through evolution precludes sequence alignment based homology methodologies for protein folding prediction from detecting structural and folding similarities for distantly related protein. Homolog coverage of actual data bases is also a factor playing a critical role in the performance of those methodologies, the factor being conspicuously apparent in what is called the twilight zone of sequence homology in which proteins of high degree of similarity in both biological function and structure are found but for which the amino acid sequence homology ranges from about 20% to less than 30%. In contrast to these methodologies a strategy is proposed here based on a different concept of sequence homology. This concept is derived from a periodicity analysis of the physicochemical properties of the residues constituting proteins primary structures.. The analysis is performed using a front-end processing technique in automatic speech recognition by means of which the cepstrum (measure of the periodic wiggliness of a frequency response) is computed that leads to a spectral envelope that depicts the subtle periodicity in physicochemical characteristics of the sequence. Homology in sequences is then derived by alignment of spectral envelopes. Proteins sharing common folding patterns and biological function but low sequence homology can then be detected by the similarity in spectral dimension.. The methodology applied to protein folding recognition underscores in many cases other methodologies in the twilight zone.
Keywords: protein structure determination, spectral analysis, cepstrum, proteomics, protein folding
1
Introduction
Incompleteness in homolog coverage is the major factor precluding total success of very well and long-established methods for protein structure modeling based on sequence alignments [23] and other methodologies [2, 24, 27, 28, 18, 13, 1, 10, 9, 15, 17, 14, 22]; divergence in sequence through evolution being the natural cause. The problems that still remain, particularly, with broadly used sequence alignment techniques are those associated with the length of the sequence of amino acids, the unambiguous differentiation of domains within the sequence, the increased number of gaps to optimize the matching residues and other factors. These problems become conspicuously apparent in what is called the twilight zone of protein structure prediction, in which proteins of high degree of similarity in both structure and biological function have poor similarity in amino acid sequence; the homology among them generally ranging from about 20% to less than 30% [8]. Recently, in contrast to those sequence alignment based methodologies, a strategy based on periodicity analysis of the physicochemical properties of the residues constituting the sequences has been proposed by the authors [7], and other researchers [5, 3, 12], and has been applied to protein secondary structure [11] and function elucidation processes [5, 12, 3, 4]. The common denominator to these methodologies is a spectral analysis of the profiles of physicochemical properties of the residues in sequences. Here we describe this new methodology and present some results obtained by the methodology and presented at the CASP4 [30] contest, and the undergoing CASP5 [31] competitions, and discuss
164
Carpio-Mu˜ noz et al.
the capabilities of the methodology as well as its potentiality as a tool for genome-wide analysis and proteomics. The underlying characteristic of the methodology is that it relates the above mentioned intrinsic, though, subtle periodicity of the physicochemical characteristics of the sequence of residues with the folding characteristics of its three dimensional structure. We utilize the classification of several protein primary structures into families and super-families in the SCOP database [20]. We have found that common folding patterns exist for proteins possessing similar spectral characteristics over a set of physicochemical parameters that we have called the dominant coefficients for a particular super-family. The methodology is effective at recognizing folding characteristics of protein structures in the twilight zone [8]. Using the sets of dominant coefficients for each type of folding, automatic assignment of an unknown sequence to its putative class, family, and super-family can be performed with high probability. This constitutes the basis of a new fully automated system for protein folding recognition, and a detailed description of the methodology as well as the results obtained is given in what follows [6].
2 2.1
Methodology Spectral Representation of Protein Primary Structures
We adopted a well known technique of front-end processing in robust automatic speech recognition (ASR) the objective of which is to preserve critical linguistic information while suppressing irrelevant information such as speaker-specific characteristics, channel characteristics, and noise [21]. This analysis-synthesis technique is based on the transformation of a signal into its cepstrum which is a measure of the periodic wiggliness of a frequency response plot. The cepstrum is calculated as the logarithm of the power spectrum of a signal and leads to a logarithmic periodgram for which the spectral envelope is obtained as a smooth curve depicted by connecting the main local peaks of the minute structure of the frequency spectrum. The technique applied to the analysis the profile of physicochemical features of the protein sequence allows extraction of information in the form of the spectral envelop which is used to model the relationship between the primary and tertiary structures of a protein. Comparison of two sequences is then reduced to an alignment of spectral envelopes representing the primary structures. After obtaining the profile of physicochemical characteristics, this is converted to the frequency domain by applying a Fourier transform. For a sequence of N amino acids the physicochemical profile is represented by xn , the discrete Fourier transform (DFT) Xk is then computed by: N −1 X 2πkn (0