Data Mining Techniques in Bioinformatics1

3 downloads 484 Views 75KB Size Report
tasks where traditional statistical methods do not work. This especially concerns tasks in which it is necessary to analyze a great body of data or ill-conditioned ...
PLENARY PAPERS

Data Mining Techniques in Bioinformatics1 N. G. Zagoruiko*, N. A. Kolchanov**, A. G. Pichueva*, O. A. Kutnenko*, I. A. Borisova*, A. V. Kochetov**, V. A. Ivanisenko**, S. V. Nikolaev**, V. A. Likhoshvai**, and A. V. Ratushnyi** * Institute of Mathematics, Siberian Division, Russian Academy of Sciences, pr. Akademika Koptyuga 4, Novosibirsk, 630090 Russia e-mail: [email protected] ** Institute of Cytology and Genetics, Siberian Division, Russian Academy of Sciences, pr. Akademika Koptyuga 4, Novosibirsk, 630090 Russia Abstract—Data Mining techniques for automatic detection of empirical regularities are described in relation to the tasks of classification, pattern recognition, and forecasting. The peculiarities of this type of problem are considered in three examples corresponding to different levels of organization of molecular-genetic systems (mRNA, protein, genetic network): (1) the prediction of a quantitative level of mRNA translational activity by analyzing the contextual characteristics of their functional regions, (2) the recognition of cleavage sites in amino acid sequences, and (3) the recognition of the type of genetic network mutation by analyzing the curves of concentration of the produced substances. 1

INTRODUCTION

Data Mining techniques are used for automatic detection of empirical regularities in the tasks of classification, recognition, and forecasting [1]. A peculiarity of these techniques is that they are oriented toward tasks where traditional statistical methods do not work. This especially concerns tasks in which it is necessary to analyze a great body of data or ill-conditioned tables (where the number of features is comparable with the number of objects) spoiled by noise and omissions, with the features measured on scales of different types without grounds for advancing hypotheses on distribution laws and so on. Many tasks in bioinformatics exhibit these peculiarities. From among the most widespread applied tasks of data analysis, the tasks of the following types can be selected [1]: automatic classification (taxonomy), choosing the system of informative features, pattern recognition, and filling the omissions in data tables. The peculiarities of these tasks are examined below using three examples corresponding to different levels of organization of the molecular-genetic systems (mRNA, protein, genetic network). 1 This

work was partly supported by the Russian Federation for Basic Research (project nos. 01-07-90376, 02-04-48508, 02-0790355, 00-04-49229, and 02-01-00082), by the Ministry of Industry, Science, and Technologies of the Russian Federation (project no. 43.073.1.1.1501), by the Siberian Division of the Russian Academy of Sciences (integration project no. 65), by the US National Institute of Health (grant no. 2 R01-HG-0153904A2), the US Department of Energy (grant no. 535228 CFDA 81.049), and the program “Integration” (project no. 274).

Received March 26, 2003

1. FORECASTING OF THE QUANTITATIVE LEVEL OF MRNA TRANSLATION ACTIVITY: ZET ALGORITHM It is known that contextual characteristics of different regions of genes influence the translational activity of mRNA [2]; however, the mechanisms of this phenomenon are not completely investigated. The results of computer analysis of relations between the expression level and contextual characteristics of different functional regions of yeast genes (5'NTS, promoter, CDS, 3'NTS) are presented in the paper. As contextual characteristics, we used the nucleotide composition (mononucleotide frequencies, the ratios between the frequencies of complementary nucleotides, di-nucleotide frequency deviations). As a criterion reflecting the expression level of a gene, the codon adaptation index (CAI) was used. Our objective is significant contextual characteristics correlating with CAI values for each of the four regions separately and for a gene as a whole. For this purpose, the ZET algorithm [1] is used. The algorithm makes it possible to forecast the values of the elements omitted in the data table of the “object–property” type and to edit (check) the entire table or its parts. The ZET algorithm is based on three hypotheses. The first one (the hypothesis of redundancy) consists in that the real tables have some redundancy, i.e., contain similar objects (lines) and mutually interdependent properties (columns). The second hypothesis (the hypothesis of local compactness) consists in that, in order to forecast the omitted element a(ij ), it is necessary to use not the entire table but only its “competent” part consisting of the elements of lines similar to the line i and elements of col-

Pattern Recognition and Image Analysis, Vol. 13, No. 4, 2003, pp. 550–555. Original Text Copyright © 2003 by Pattern Recognition and Image Analysis. English Translation Copyright © 2003 by MAIK “Nauka /Interperiodica” (Russia).

DATA MINING TECHNIQUES IN BIOINFORMATICS

MISCCO1A S61567 /// SCACT /// SCHAP /// SSCARG56

x( j ) CAI 0.16 0.16 … b(ij) … 0.09 … b(lj)

/// x(k) /// /// A_LDR /// E_SCORE /// TT5_OE … 30.00 … 0.76 … 1.13 … 36.62 … 0.70 … 1.00 … … … – … … … 33.33 … b(ik) … 1.48 … … … – … … … 31.82 … 0.81 … 0.00 … – … – … … b(lk) … 46.94 … … 0.62

551

x(j)

b(k)

b(ik)

x(k)

Fig. 1. Filling of the omitted element b(ij ).

umns similar to the column j. The rest of the lines and columns are not informative for the given element. The third hypothesis (the hypothesis of linear regularities) consists in that, in the ZET algorithm, of all possible types of dependencies only the linear regularities between columns (lines) are used (see Fig. 1). If the regularities are more complicated, their reliable detection requires a great amount of data, which occurs very seldom in real tasks. The operation of the ZET algorithm includes three stages. At the first stage, for the given omission, in an initial matrix “object–property,” whose columns are normalized over dispersion, a submatrix of “competent” lines is chosen, and then, for these lines, the “competent” columns are chosen. At the second stage, parameters of the formula used for forecasting the omitted element are chosen automatically. At the third stage, the element is forecasted with the help of this formula. By the competence of the l th line with respect to the i th line we understand the value L(il ) = r(il )*t (il ). Here, r(il ) = 1 – ρ(il ), ρ(il ) is a Euclidean distance between the i th line and l th line, and t(il ) is the “assemblage” coefficient equal to the number of properties whose values are known for both the i th and l th lines. The competent line should not have an omission in the j th-column. By the competence of the kth column with respect to the j th column we understand the value L( jk) = r( jk)*t( jk). Here, r( jk) is a module of the coefficient of correlation between the j th column and the kth column and t( jk) is a coefficient of assemblage equal to the number of objects for which both j th property and kth property are known. The competent column should not have an omission in the i th line According to the user’s command, the software chooses a competent submatrix of any size in the limits from 2*2 up to n*m. Usually, the submatrix contains from three to seven lines and columns. During forecasting of the omission value with the help of a dependency between the j th column and other PATTERN RECOGNITION AND IMAGE ANALYSIS

columns (kth columns), the prompts b(k) are formed. For this purpose, an equation of linear regression between the j th column and kth column is used. If there are (q + 1) columns in a submatrix, q obtained prompts are averaged with a weight proportional to the competence of the corresponding column. As a result, the forecasted value b(q) generated by redundancy in the columns is formed: q

α

  b(q) =  b ( k )*L ( jk )  k = 1 



α

q

∑ L ( jk )… .

(1)

k=1

Here, α is a coefficient controlling the influence of the competence on the result of the forecast. If α is small, the difference in competence is insufficient; if α is great, the competent columns have much more influence than other ones. The choice of α is the essence of the stage of formula fitting for forecasting: all known elements of the j th column are forecasted for different α values and, then, such a α value is chosen for which the error of the forecast is the smallest. According to formula (1) with a chosen α value, the forecast b(q) for the omitted element value is made and the minimum value δ( j ), obtained while choosing α, is taken as an estimate of the expected error of omission filling over columns. The procedure of omission filling by using the regularities between the i th line and all other s lines (l th lines) (1, 2, …, l, …, s) is similar to that described above and is carried out by the formula S

α

  b ( s ) =  b ( l )*L ( il )  l = 1 



s

α

∑ L ( il )… .

(2)

l=1

In order to choose α, all known elements of the i th line are used and a choice is made when the error of their forecast δ(i ) is minimum. General forecast b'(ij ) of the omitted element b'(ij ) is obtained by averaging prompts with the weight

Vol. 13

No. 4

2003

552

ZAGORUIKO et al. Forecasted values 0.9

80–100%

more than

up to 20%

100%

0.8 0.6

50–80%

0.5 0.3 0.2 0

20–50% 0.2

0.4

0.6

0.8 1.0 Initial data

Fig. 2. Forecasted vs. initial values of CAI parameters.

inversely proportional to the value of the expected error: b' ( ij ) = { b ( q )/ [ ε + δ ( j ) ] + b ( s )/ [ ε + δ ( i ) ] } (3) * { [ ε + δ ( j ) ]* [ ε + δ ( i ) ] }/ { 2ε + δ ( j ) + δ ( i ) }…. Here, ε is a constant, e.g., 0.01, introduced in order to avoid division by zero. For different applied tasks, many modifications of the described basic ZET algorithm were made. The ZET-R algorithm is used for detecting gross errors in the initial data table (so-called table editing mode). For this purpose, the program by turns forecasts all elements of the table and compares the results of the forecast with real data. In this way, the gross errors or intentional distortions of some elements in the data table can be found. In addition, in this mode, we can determine how strong the connection between elements of the forecasted characteristic (desired characteristic) and characteristics included in the competent submatrix is: the stronger the connection, the more accurate the forecast of the desired characteristic (see Fig. 2). It is also important to take into account how often the given characteristic was included in the competent submatrix. The more often it was competent, the more important it was for the desired characteristic. This mode of table editing was used for estimating the relative importance of a contextual characteristic for CAI translational activity of yeast genes. As a result of the analysis [3], it was found that the forecast accuracy of the activity of different genes from their characteristics varies (see Fig. 3). For 30% of genes, the forecast error is less than 20%, and for 40% of genes the error is in the range from 20% up to 50%. There are genes (30%) whose activity is forecasted with an error of more than 50%. For this group of genes, it is necessary to use additional characteristics. The importance of different segments of a gene was estimated, and, for every segment, the importance of

Fig. 3. Number of elements the real forecasted error of which hits the respective range.

each of its contextual characteristics was also estimated. The percentage of cases of real minimum error in activity forecasting for each of the segments gives 22% for 5'NTS, 55% for CDS, 19% for PROM, and 3% for 3'NTS. The CDS segment characteristics give the best results. However, other segments can also be used for forecasting gene expression level. Thus, on the base of the ZET algorithm, a method for predicting the gene expression level from its functional segment characteristics and for estimating the dependence of activity on different contextual characteristics of a gene was obtained. 2. RECOGNITION OF THE TYPE OF MUTATION IMPAIRMENT IN GENETIC NETWORKS: RTM ALGORITHM The recently developed new experimental techniques (laboratories-on-a-chip) automatically provide kinetic characteristics of functioning for hundreds and thousands of genes and their products in cells. The actual task in analyzing these data is to recognize the type of mutation impairment in genetic networks. By solving this problem, we can develop procedures for diagnosing diseases caused by disturbances in the operation of genetic networks, create medicines of a strictly specialized action on given molecular-genetic and biochemical processes occurring in cells, and so on. The genetic network controlling erythroid cell differentiation under the action of erythro-protein was investigated in the paper. Using the model of this network, developed in the framework of the generalized chemical-kinetic procedure [4], information about changes in the concentration of 34 different substances involved in biochemical reactions was obtained. They were monitored for 100 hours. The result for every state of the network was tabulated into 34 columns (different substances) and 100 lines (different moments of time).

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 13

No. 4

2003

DATA MINING TECHNIQUES IN BIOINFORMATICS Molecules/cell 8.0E+01 6.0E+01

1a

1b

2a

2b

3a

3b

553

4.0E+01 2.0E+01 0 4.0E+03 3.0E+03 2.0E+03 1.0E+03 0 4.0E+03 3.0E+03 2.0E+03 1.0E+03 0 1

21

41

61

81 101 1 Hours

21

41

61

81

101 Hours

Fig. 4. Dynamics of a change in the concentrations of geme (1), receptors bound to transferrine at the surface (2), and mRNA GATA-1 (3) normal (a) and with different mutations (b).

The mutations disturbing the work of a certain link of a genetic network were simulated by changing some parameter values of the model. In total, 19 different mutations were simulated. The observed kinetics of “mutant erythrocyte” maturation together with the calculations of a normal system were used as a learning sample. Ten variations were generated for every mutation. The observed kinetics were used as a learning sample. It was necessary to suggest a recognition procedure for assigning some state of the network with an unknown deviation from the norm to one of 19 types of mutation. The problem of mutation type recognition can be solved on the basis of the pairwise recognition principle [5]. Usually, the recognition of a great number of patterns is done in a feature space common to all patterns. However, it is evident that, for reliable recognition of pattern pairs (A, B), (A, C), and (B, C), it would be reasonable to use characteristics individually adjusted for every pair, so that patterns in each pair differ maximally. For example, in speech recognition of the words “sixteen,” “fifteen,” and “fifty,” for the first pair it is reasonable to use characteristics connected with the beginnings of these words and for the second pair, with their endings. If K patterns need to be recognized, this approach requires the construction of templates for each pairwise combination and, then, recognition by an efficient procedure of pairwise comparison of competitive patterns. This particular task was performed through the following modification of the general idea. For every subPATTERN RECOGNITION AND IMAGE ANALYSIS

stance w at the moment t, its informativeness J(w, t) was determined by the number n of different pairs of mutation from the learning sample. All available variations can be accurately distinguished by examining the concentration function for the given substance and given moment of time only. Taking into account this informativeness, the approximate algorithm was used to choose the minimal collection of substances and, for each substance, the time intervals of minimal length were chosen to correctly distinguish all variations of all mutations from the learning sample. A decision rule is a list in which every pair of matched pattern templates is set in correspondence with informative substances and each substance, with the moments of time for accurate recognition of all variations of all mutations from the learning sample. For each mutation, the use of these data helps us to determine whether it is a variation of one of the two examined types of mutation or not. At every step, this question was solved for a current pair. The process was organized in such a way that, within a number of steps not exceeding the number of mutations, one of the following results was obtained: either the mutation type of the control sample was determined or it was established that the control realization did not belong to any given type of mutations. An operation of the algorithm revealed that the construction of a decision rule for recognition of all 19 types of mutations requires information about the concentration of only 3 of the 34 substances: geme, receptors bound to transferrine at the surface, and mRNA GATA-1 (see Fig. 4). Out of the whole period of monitoring, it was

Vol. 13

No. 4

2003

554

ZAGORUIKO et al.

L=8 –3

–1

1

3

Fig. 5. Symbol positions in the analysis window.

sufficient to use data obtained in the time intervals of 11, 23, and 2 hours only. For the chosen characteristics, all control mutations were accurately recognized. The same mathematical model of the genetic network was used to extract information on the genetic network behavior for 9 single and 36 double mutations formed by the action of all pairs of single mutations. Based on the combination of different measures of closeness obtained in the entire feature space, the recognition program provided a list of the most probable single mutations contained in double mutations. The correct solutions were obtained in 89% of cases [6]. The use of the pattern recognition technique reveals a great redundancy of information of the dynamic characteristics of genetic networks. The feasibility of recognition of single mutation of an arbitrary type and double mutations from manifestations typical for single mutations was shown. This indicates the way to create procedures for diagnosing diseases caused by impairments in genetic networks. 3. RECOGNITION OF PROCESSING SITES OF SIGNAL PEPTIDES: ADDEL ALGORITHM The automatic recognition of signal peptides and cleavage sites in proteins is an actual task for both recognition of their intracellular localization and applied problems in medicine and biotechnology. In this paper, the application of image recognition techniques is considered and the possibility of using physicochemical characteristics of amino acids for the problem solution is investigated. The learning sample is represented by sets of different protein fragments (EUK, ECOLI, HUMAN, GRAM+, GRAM–): fragments containing cleavage sites (“Signal peptide” pattern), fragments containing anchors (“Anchor” pattern), and fragments of nuclear and cytoplasmic proteins containing neither sites nor anchors (“Negative” pattern). The aim is to recognize to which of the three mentioned patterns an unknown protein belongs. For the proteins of the first pattern, it is necessary to indicate the most probable location of the cleavage site. In the investigations, 10 Kidera properties [7] and 434 structural and physicochemical properties of amino acids [8] were used. Thus, the 444-dimensional vector of properties corresponded to each of the 20 elements in the amino acid alphabet [9]. At the first stage, the dependence of the quality of detection and location of cleavage sites on the width of

the analysis window, i.e., on the number of examined symbols on the left and on the right from the point of cleavage, was examined. The window width varied from 6 up to 36 symbols. The recognition was performed by the cross-validation method. The “k nearest neighbors” rule for k = 1 was used as a decision rule. Simultaneously, a “hypothesis of parity” was used. It is known that neighboring amino acids have opposite spatial directions. The amino acids, one after every other following symbol, have the same orientation and can together take part in certain biochemical processes. Therefore, the function of cleavage we are interested in can act differently at even and odd positions of amino acids in the analysis window. Experiments show that the best results are obtained for the window with L = 8 (see Fig. 5), wherein only the elements with odd numbers were used, which is in complete agreement with the known rule “–3, –1.” Then, the informative features were chosen from among 444 physicochemical characteristics of the amino acids. The window of a width of L symbols moved along the protein chain from left to right with a shift in one symbol, and for each window, a decision was made about the presence of two patterns “Signal peptide”/“Negative” in it. Seven learning samples containing the same set of 253 elements of the first pattern and 7 different sets of 253 elements of the second pattern were formed. Data that were not used in the learning process were used for control recognition. The features were chosen using the AdDel algorithm [1], combining the ideas of the methods of “sequential addition of the most important” (Addition) and “sequential deletion of the least important” (Deletion) features. It turns out that the best results are obtained when the number of features is sufficiently small: from 7 up to 30 features. The seven collections contained 92 features among 444. The features of the Kidera set did not manifest any advantages: only one of them hit one collection. Let us give an example of two sets of informative features: (1) BEGF750101, BROC820102, CHOP760104, CHOP780206, CHOP780208, CHOP780214, CHOP780216, DAYM780101, DESM900102, FAUJ830101, FAUJ880101, GEIM800105, GEIM800110, JANJ780103, JUNJ780101, KANM800102, KANM800103, KYTJ820101, NAKH900111, PRAM900101, QIAN880111, QIAN880114, TANS770102, VASM830101, WARP780101, AURR980102, AURR980120. (2) BIGC670101, CHAM830104, FASG760102, GARJ730101, JANJ790101, LEVM760105, MCMT640101, ZASB820101, Kidera8. When the control sequence was recognized, the window of a length of 18 symbols moved along the protein chain from left to right with a shift of one symbol. Thus, 8940 segments were separated on 252 control

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 13

No. 4

2003

DATA MINING TECHNIQUES IN BIOINFORMATICS

purpose, the method of pairwise comparison described above was used. In addition to physicochemical properties, information on the occurrence of amino acids in positions –3 and –1 of the analysis window was used. Such additional information increases recognition quality. Testing a great volume of experimental data showed that cleavage sites were correctly detected and localized in 85% of cases [9].

0.5

0

–30 –20 –10

10

20

30

–0.5 Fig. 6. Membership function for fragments containing cleavage sites.

10

20

30

40

50

555

CONCLUSIONS The obtained results indicate the efficiency of the intelligent analysis technique (Data Mining) in solving a wide range of problems in bioinformatics aimed at investigation of the patterns of the structural–functional organization of molecular-genetic systems.

60

0

–0.5

REFERENCES Fig. 7. Membership function for fragments containing no cleavage sites.

proteins containing cleavage sites. There were many more proteins without cleavage sites; therefore, 296177 segments were chosen for the second pattern. For every position of the window, a decision was made on the presence of a cleavage site. The two patterns were recognized in parallel over the seven feature sets mentioned above. Decisions in favor of the first or second pattern were made by the majority of seven votes. In the case when 252 segments of the first pattern were recognized, the correct decisions were made in 213 cases (84.5%). Among the segments without cleavage sites, 234569 segments (79.2%) were chosen for the second pattern. The average reliability of recognition and cleavage site localization is considered to be 82%. In actual practice, not only a decision “Yes” or “No” presents interest but also estimation of the probability of cleavage site occurrence or nonoccurrence in every window. Such estimation can be obtained with the help of a modified rule of “k nearest neighbors.” For every control object y, let us find the distances r1 and r2 to the two nearest neighbors, one per pattern. The sum of these distances is R = r1 + r2. The membership function for the first pattern (“Yes”) is f = 1 – 2*r1/R and varies in the range of +1 to –1. If f > 0, the object y belongs to the first pattern and vice versa. In our case, for the given control sample, such functions were obtained over seven feature sets, so the total value of the membership function F was obtained by averaging seven partial solutions: F = {f 1 + f 2 + … + f 7}/7. By using the same experimental material and by averaging partial solutions, we obtained an average reliability of recognition of the two patterns equal to 85%. Averaged values of the membership function for the two patterns are shown in Figs. 6 and 7. The developed technique was applied in order to recognize the three above-mentioned patterns. For this PATTERN RECOGNITION AND IMAGE ANALYSIS

1. Zagoruiko, N.G., Prikladnye metody analiza dannykh i znanii (Applied Methods of Data and Knowledge Analysis), Novosibirsk: IM, 1999. 2. Kochetov, A.V., Ishchenko, I.V., Vorobiev, D.G., Kel, A.E., Babenko, V.N., Kisselev, L.L., and Kolchanov, N.A., Eukaryotic mRNAs Encoding Abundant and Scarce Proteins are Statistically Dissimilar in Many Structural Features, FEBS Lett., 1998, vol. 440, pp. 351– 355. 3. Pichueva, A.G., Kochetov, A.V., and Zagoruiko, N.G., Study of the Relation Between Expression Level and Contextual Characteristics of Yeast Gene Functional Regions by the ZET Method, Proc. 2nd Int. Conf. on Bioinformatics of Genome Regulation and Structure, Novosibirsk, Russia, 2000, vol. 3, pp. 58–61. 4. Ratushny, A.V., Podkolodnaya, O.A, Ananko, E.A, and Likhoshvai, V.A., Mathematical Model of Erythroid Cell Differentiation Regulation, Proc. 2nd Int. Conf. on Bioinformatics of Genome Regulation and Structure, Novosibirsk, Russia, 2000, vol. 1, pp. 203–206. 5. Zagoruiko, N.G., Pattern Recognision by the Method of Pair-wise Comparison of Standards in the Competent Features Subspaces. Dokl. Ross. Akad. Nauk, 2002, vol. 382, no. 1, pp. 24–26. 6. Borisova, I.A., Zagoruiko, N.G., Lichosvai, V.A., Ratushny, A.V., and Kolchanov, N.A, Diagnostics of Mutations Based on Analysis of Gene Networks, Proc. 2nd Int. Conf. on Bioinformatics of Genome Regulation and Structure, Novosibirsk, Russia, 2000, vol. 2, pp. 163–165. 7. Kidera, A., Konoshi, Y. et al. Statistical Analysis of the Physical Properties of the 20 Naturally Occurring Amino-acids. J. Prot. Chem., 1985, no. 4, pp. 23–55. 8. Database of Structural and Physicochemical Properties of the Amino Acids; available at http://www.cbs.dtu.dk/services/SignalP. 9. Zagoruiko, N.G., Kutnenko, O.A., Nikolaev, S.V., and Ivanisenko, V.A., Recognition and Localization of Cleavage Site in Signal Peptides Proc. 2nd Int. Conf. on Bioinformatics of Genome Regulation and Structure, Novosibirsk, Russia, 2000, vol. 3, pp. 104–107.

Vol. 13

No. 4

2003