Quantitative sequence-activity models (QSAM)--tools for sequence ...

Nucleic Acids Research, 1993, Vol. 21, No. 3 733-739

Quantitative sequence-activity models (QSAM) tools for sequence design Jorgen Jonsson, Torbjorn Norberg', Lena Carlsson1, Claes Gustafsson1 and Svante Wold Research Group for Chemometrics, Department of Organic Chemistry and 1Department of Microbiology, University of Umed, S-901 87, Umed, Sweden Received May 18, 1992; Revised and Accepted December 29, 1992

ABSTRACT Models have been developed that allow the biological activity of a DNA segment to be altered in a desired direction. Partial least squares projections to latent structures (PLS) was used to establish a quantitative model between a numerical description of 68 bp fragments of 25 E.coli promoters and their corresponding quantitative measure of in vivo strength. This quantitative sequence-activity model (QSAM) was used to generate two 68 bp fragments predicted to be more potent promoters than any of those on which the model originally was based. The optimized structures were experimentally verified to be strong promoters in vivo. INTRODUCTION We are here concerned with the relation between the composition of a DNA sequence and its associated biological activity. The analysis of sequence data has traditionally been concentrated on qualitative pattern recognition (i.e. classification). This involves, for example, models that are based on the observed similarity (i.e. homology) between sequences (1-5). Homology based models have also been used in attempts to model the magnitude of functional properties of sequences (6). Such models have, however, been criticized for being of limited predictive value (7, 8). Here we aim to demonstrate that sequence data may carry two complementary pieces of information. The first part is the homology, i.e. information related to absence of variation. The second, less well recognized information is that conveyed by systematic variation. For quantitative correlations in a class of related sequences, the information based on systematic variance must also be extracted and utilized. The reason that potential co-variance structures are usually not considered is that the potential of multivariate methods like e.g principal components analysis (PCA), partial least squares projections to latent structures (PLS) and neural networks (NN), for sequence modelling purposes has not been recognized until recently (9-11). NN have successfully been used to classify digitized DNA sequences, see e.g. (12 -14). The applicability of PLS to quantitative sequence-activity modelling (QSAM) has been addressed by us in earlier papers (10, 14-16). It is thus important to distinguish between QSAM and classical sequence pattern recognition modelling (see refs. above). In the present

context the QSAM is developed within a class of functionally related sequences. The objective is to delineate the relationship between the sequences and the magnitude of their corresponding

biological activity. The aim of this paper is, however, not primarily to design a strong bacterial promoter. The objective is rather to outline a general strategy whereby the relationship between the composition of a bio-polymer and its biological activity may be quantitatively described and subsequently utilized. It should be noted that models of the category presented here are local linearizations of the more complex functions underlying the biological phenomena observed. Consequently, QSAMs are of local validity, i.e. interpretations and predictions relate to the experimental conditions used to characterize the set of sequences upon which the models are based.

Parametrization of DNA sequence A numerical representation of the sequence is a prerequisite for quantitative modelling. One possibility is to use a qualitative (discrete) parametrization of the monomers. This corresponds to the use of indicator variables that unequivocally and symmetrically state the identities of the relevant monomers. This implies that a minimum of three descriptor variables/base must be used in order to obtain an informationally efficient representation for DNA (see refs. 10 and 12). Another alternative is given by quantitative monomer parameters, i.e. continuous variables, such as principal properties (PPs) derived from measured physico-chemical data collected for monomers of interest (15, 16). These two kinds of descriptors have different advantages. The qualitative indicator variables are conceptually simpler, easier to derive and more readily interpreted. Properly derived quantitative descriptors may, however, enable an interpretation in terms of which physico-chemical factors that are important for the biological response and how they combine etc. In this example we have utilized four discrete indicator variables to represent the bases of DNA (A= 1000, C=0100, G=0010 and T=0001). The reason for this selection is that the resulting model parameters are the least complicated to interpret. Promoter sequence data Prokaryotic transcriptional promoters are specific DNA sequences that are recognized by the a-unit of the RNA polymerase holoenzyme (RNAP). The assembled enzyme complex

734 Nucleic Acids Research, 1993, Vol. 21, No. 3

subsequently initiates and carries out the transcription of mRNA from the DNA template. There are many examples of sequences known to be functional E. coli promoters. Some originate from the bacterium itself, others from infecting phages. Few of these promoters have been consistently characterized with regard to their in vivo promoter strength. However, a system that allows this efficiency parameter to be determined relative to an internal standard has been developed and used to characterize promoters by Bujard and coworkers (5, 6, 15-17). In this assay the strength of the test promoter (in front of the hydrofolate reductase gene, dzfr) is expressed relative to that of the promoter for ,B-lactamase (Pbla) which is present on the same plasmid. Monitoring of the mRNA expressed from the promoter under study in relation to the standard, permits the relative promoter efficiency to be determined unbiased by translational effects or gene dosage. These data were considered to be comparatively well suited for QSAM development for two reasons; a) this material comprises a relatively large set of structures that are multipositionally altered, and thereby informationally better suited than similar sets generated using saturation mutagenesis, and b) the better comparability for additional structures resulting from an experimental protocol comprising both an internal standard and external references.

Analysis of sequence data is normally based on the assumption that certain positions in the sequence in some way interact with a target molecule. This, in turn, corresponds to the requirement that the sequences to be analyzed are of similar length and that they are properly aligned. Modifications of these requirements may, for example, be made by dividing the sequence into subsequences around given points of reference. However, structural descriptions that are alignment independent may also be accomplished, e.g. according to the principles outdined by van Heel (9). In this paper the principles of multivariate DNA QSAM are illustrated using the traditional alignment dependent representation of sequence data. The present models are hence based on promoters having similar distances between the positions -35, -10 and +1 this, consequently, makes each position of the 68 mer to be more directly comparable. From references (5, 6, 15-17), the 68 bp fragment (-49 to +19) relative to the start of transcription was compiled. All promoters having a 17 bp spacer between -35 and -10 region and a 7 bp spacer between -10 and +1 region were considered. This subset comprising 25 promoters was found to comprise three major categories; 1) PD/E20, PG25, PJ5 and PN25 from phage T5, 2) PL from phage lambda and, 3) P. an ardficial construct originally synthesized by Dobrynin et al. (18). The structures of promoters

Table 1. Promoter

Strength

Promoter sequences

(log Pw.-units) -49

I 2

PD

P02

3 PJS 4 PN25 5 PN2503 6 PN2 504 7 PN2505 8 PN5px

9 PN253/DSR 10 PN25A" 1 1P K and that all the variables (K) are independent (i.e. orthogonal). PLS (21, 25) is used to correlate a single dependent variable y (or a matrix Y), to the variation in a predictor matrix X. PLS is a generalization of PCA where the components of X (ta) are calculated so that they well approximate X and correlate well with y. Since PLS is a projection method, it can handle collinear data having many more variables (sequence descriptors, K) than objects (sequences, N), as long as the resulting components (A) are few compared to N. The result is a stable approximation of the correlation between X and y. The statistical significance of PLS models is also determined by cross-validation. In this paper PLS is used to relate the promoter efficiency variable (y) to the systematic variation in the promoter sequence matrix (X, the 25 parametrized 68 mers). The interpreted model is subsequently used to generate suggestions of sequences containing the essence of the structural features characteristic of strong promoters. Choice of model This QSAM was attempted using both PLS and a heteroassociative back-propagation NN, the results obtained were similar. However, we only present the results from the PLS QSAM since this method was found to be advantageous for a number of reasons, namely; 1) PLS is more robust, since it does not require the proper initial setting of numerous variables (e.g. the number and size of hidden layer(s), choice of weight function(s), epoch length, etc.) which is required for the NN, 2) PLS converges to the stable solution in a matter of seconds rather than hours, 3) PLS proceeds in a fashion that allows the statistical significance of the model to be simultaneously evaluated. The risk of overfitting the model to the data is therefore minimized while the predictive capabilities are optimized, and 4) the PLS weights are more readily interpreted both in terms of which is the optimal monomer in each position, but also if quantitative descriptors are used, the physico-chemical reason(s) why a certain monomer is preferred. The interpretation of the NN QSAM was not equally straightforward irrespective of the sequence descriptors used.

Results of the PLS QSAM Two PLS dimensions, significant according to cross-validation, accounted for 27.5% of the systematic variance of X and explained 85 % of the variance in the promoter efficiency variable (y). The first PLS dimension alone used 11.5% of X to explain 73% of y, the second added an additional 16% of X and 12% of y. These results refer to autoscaled data (i.e. each variable is scaled inversely proportional to its variance), the use of unscaled data, however, gives similar results. The PLS weights were subsequently transformed into PLS regression coefficients. The relative size of these coefficients over different positions indicate their relative importance to the promoter strength. To display the influence of each position we have in Fig. 1 summed the absolute values for each group of four coefficients and corrected this number for the degrees of freedom (DOF) for each position. From figure 1 it can be seen that positions -135 to -33, -11 and + 1 are constant for all 25 promoters in this example. These positions are probably important to the promoter strength, although the magnitude of their influence cannot be assessed from the present data. Among the varying positions we see that position -12 is the most important followed by positions 4 to 14 in the downstream region, positions around -8 to -10 and position

736 Nucleic Acids Research, 1993, Vol. 21, No. 3 Influence on y

a, .

-49 -45 40-35 -30 -25 -20 -15 -10 -5 +1 5

I

I

19 Pbla strength 2Caic. A Pred. o

26027 1 21 19 24

.5-

1OAL3 7 I15

1 0i.5

A

Al A8

0

I

I

0.5

0

10 15 19

1

4

Position influence =

L (( Icn )/DOF) x 104 n=1

Where c are the PLS regression coefficients and DOF the degrees of freedom for each position, (the number of bases actually occurring in a particular position minus one). The correction for DOF results in that a relatively larger importance is given to the more conserved positions.

Obs. A/ Pred. o

Figure 2. PLS correlation plot for the two dimensional model, showing the promoter strength calculated from the QSAM for the 25 training set objects plotted versus the corresponding literature data. The two structures suggested, PLSI and PLS2 are included in the plot, using the predicted values on both axis. These are predicted to be the most potent promoters. Promoter strengths are given in logarithmic bla-units, numbers correspond to those in Table 1.

12 t2 8

-38. The relatively large influence of the downstream region seems to corroborate the observation that in vivo promoter strength is dependent on more than one functional parameter (15). The importance of this region may reflect the contributions to promoter strength from ease of initiation and/or the speed of RNAP promoter clearance. The 272 PLS regression coefficients were then examined in detail, in groups of four corresponding to the descriptors for each of the 68 positions. For each of these groups the largest positive value indicates the 'preferred' base of that particular position. An entirely synthetic sequence was thus generated by selecting the most favourable base, with respect to the model. This was made both for the one and the two dimensional PLS models. The result was two strength-optimized 68 bp sequences denoted PLS1 and PLS2. For the homologous and close to homologous positions in the -35, -10 and + 1 regions the PLS sequences are determined by the bases having descriptors matching the corresponding column averages (x). The PLS1 and PLS2 from the one and two dimensional QSAM were subsequently parametrized and reinserted into the model and their in vivo promoter strength was predicted, see Table 1 and Fig. 2. Promoter strength predictions from the NN QSAM were similar, (data not shown). To visualise the sequence characteristics of these theoretical promoters in relation to those of the training set, all sequences were analyzed by PCA. The result was a significant two component model describing a total of 30% (19 and 11% respectively) of the systematic variance in the composition of the sequences. The scores from the two components are plotted in Fig. 3. The promoters of the training set are seen to be clustered according to their origin, (phage T5, phage lambda and constructs). The strength optimized constructs from the QSAM are positioned separately at the lower part of the projection. The four most potent training set sequences PL/N25DSR, PLcon/N25DSR PL/N25USR and PD/E20 are all seen to be situated in this region. The QSAM has thus pointed us further into the direction of strong promoters, outside the region defmed by the original 25

I

1.5 2 1g Pbla strength

Position

Figure 1. The influence on in vivo promoter strength of each position in the considered 68 bp fragment displayed as;

16

A13 S 14A

14 (Al13 AconE 417AA1

8120 21At_23

AIO A2 T5 4 A51 * 74,

25\', , 23 AA 24/

-8 -12

A9

7Ai74

--L (26U 027r -

-4

0

4

8

12

Figure 3. Score plot from the PCA on all 27 structures. The first component (tI) describes 19% and the second (t2) 11% of the systematics in the composition of the sequences. Numbers refers to those in Table 1. sequences. Also in this analysis, the PC loadings show that the region downstream + 1 is the main determinant of the patterns observed, (data not shown). To validate the predictive capabilities of this QSAM it was subsequently decided to synthesize the promoters suggested by the model and determine their relative in vivo promoter strengths. An external reference set of six promoters ranging from weak to strong was kindly provided to us by H.Bujard and co-workers. However, since the reference promoters were found to be of different lengths, it was decided to include three versions of the reference promoters (PA,, PD/E0 and PL