TRANSFAC Retrieval Program: A Network Model Database of ...

3 downloads 0 Views 3MB Size Report
Database of Eukaryotic Transcription Regulating. Sequences ... It is of particular importance in eukaryotes because it is connected to problems of developmental.
TRANSFAC Retrieval Program: A Network Model Database of Eukaryotic Transcription Regulating Sequences and Proteins R.

KN\l=U"\PPEL, P. DIETZE, W. LEHNBERG, K. FRECH,1 and E. WINGENDER

ABSTRACT DNA sequences that are involved in the control of gene expression in eukaryotes have been collected in conjunction with the proteins binding to and acting through them (TRANSFAC data set). To make these data accessible, the TRANSFAC retrieval program (TRP) has been developed as a database management system which is based upon the network model. This database model possesses particular advantages for data management of a complex structure. The aim of TRP is to provide an easily handled statistical basis for a computational approach to transcription control.

Key words: TRANSFAC; network model database; transcription factors INTRODUCTION question of how biological information processing is realized is one of the central problems of biIt is of particular importance in eukaryotes because it is connected to problems of developmental

Theology.

control of an organism, morphogenesis and cell differentiation, tissue specificity, hormonal communication, or cellular stress responses, e.g., toward heat or viral infections. The flow of the relevant biological information is controlled at various levels, among which transcriptional control appears to be one of the most important (Wingender 1990,1993). It is exerted by distinct DNA elements that are generally located in the 5' region of a gene, i. e., within its promoter or enhancer region, but enhancers may also be localized downstream of the 3' end of the transcribed part of the gene, within its introns or even within coding sequences. These elements act through specific interactions with proteins (transcription factors) that stimulate or repress the initiation of transcription by RNA polymerase. However, the binding specificity of transcription factors is rather relaxed and, therefore, it is very difficult to forecast just on the basis of sequence information, a task obviously mandatory for interpretation of data that will emanate in the course of genome research projects. There is a rapidly increasing body of literature concerning the regulation of individual genes, the sequence elements that govern this regulation and the proteins that bind to them and thereby mediate their function. Even for experts working in the field, it becomes more and more difficult to keep a survey of the mechanisms conferring control of gene expression and the eis- and trans-acting components involved. There were some early attempts to compile regulatory DNA sequences and the corresponding transcription factors as they appear in

Gesellschaft für Biotechnologische Forschung, Department of Genetics, Braunschweig, Germany. 'AG BIODV, Institut für Säugetiergenetik, GSF—Forschungszentrum für Umwelt und Gesundheit mbH, Ingolstädter Landstrasse 1, D-85758 Oberschleißheim, Germany.

the genome (Wingender, 1988) or in the form of canonical and/or consensus sequences derived from them (Johnson and McKnight, 1989; Latchman, 1990; Locker and Buzard, 1990; Faisst and Meyer, 1992). Later, these data collections were largely extended and were transferred into computer-readable format (Ghosh, 1990,1991,1992; Wingender etal, 1991). In this paper, we present the TRANSFAC database that contains an updated data set of previously published tables (Wingender, 1988) and manages data access using a network database model.

SYSTEM AND METHODS Database models A database model is the general description of the way in which data and their relationships are stored and processed in a database system. The most important database models are the hierarchical model, the network model, and the relational model. As the hierarchical model is only a special case of the network model, it will not be discussed here. Although different terms are used, all database systems store data in database records consisting of related fields which contain string or numeric data. They differ in the way relationships are modeled and operations on the data are specified. The network database model was developed by the Conference on Data Systems Languages (CODASYL) Data Base Task Group (DBTG) (CODASYL, 1978). The key concept for modeling relationships is the 'set,' which is defined by two records specified as owner and member, resp., and a sorting order. One entry of the owner record may be connected to an infinite number of member record entries by one set, whereas one member record is connected to one owner record entry. Thus, each set models an one-to-many relation. Many-tomany relations between two records are modeled by defining both of them as owner records and connecting them to a common member record by two sets; this common member record may hold further data that characterize the connection of the respective owner entries (Fig. 1).

entry

common

cell

embl-link

O

© ©

pir-link

member records:

transfac facref tfref

§facsyn fan*

meth swiss-link

ifac

0 © tf_li_embi © facembl © façons © facmat @ matref

Ôl)

class

syn

tfmeth tfcell

claref

matrix

relations: one : one one :

many

member-

owner-

recoid

many : many

owner-

record 1

owner-

record 2

,

common

FIG. 1.

member record

Data structure of the network-model TRANSFAC database as it is realized in the TRANSFAC retrieval program

(TRP). The owner records are depicted in rectangles, and common member records required to establish many-to-many relations between them are indicated by circles. The sets that connect the owner and the common member records by one-tomany relations are represented by arrows and are appropriately labeled.

The relational database model, which is based on theory from relational algebra, was developed by Codd (1970). From the user's point of view, a relational database is simply a finite number of two-dimensional tables. Connection of the tables is realized through common data fields (columns of the table). Operations to be performed on the database are expressed in terms of relational algebra (e.g., projection or join). Because the network database model largely reduces redundant data and allows fast data access along sets, TRANSFAC has been implemented as a network database, because it will contain large amounts of data, and therefore the performance of data access is an important feature of the database.

Software used The acceptance of software depends on its user friendliness. Nowadays, window and mouse technique, buttons, and clicks are standard. Furthermore, the success of a program depends on its portability to different computer (processor) platforms. The TRANSFAC retrieval program (TRP) is therefore written in C, which is a widely standardized programming language. C-scape from Liant Software Corporation was used as the user interface management system and Raima Data Manager from Raima Corporation was used as the database management system. The software products generated with these tools are license-free and may be freely distributed, e.g., per anonymous file transfer protocol (ftp). They will be available for different platforms (DOS, Unix, VMS). The user interface has the typical characteristics of a CUA (Common User Architecture) window; it is running in text and graphics mode on DOS and as an X-window/Motif Application on Unix and VMS. Screens are written into a separate screen file so that they can be modified easily. The database kernel supports development of multiuse network and relational databases and provides a SQL interface. A DataDesigner from ESM Software GmbH (Nürtingen, Germany), distributor of the two above-mentioned tools, combines both. Database functions and their manipulation via the user interface are defined by ASCII macro files. During program development, changes can be performed by modifying the ASCII files without the need for a new compilation. In separating the database function from the user interface, the program can easily be adapted to new environments, e.g., Windows NT or OS/2.

Availability The TRANSFAC retrieval program is developed on a PC and has been tested for DOS and VMS. It is available for these two operating systems. The DOS Version is compiled in 16-bit code and will run in 640 KB in text mode only. A Unix version (Ultrix on a DEC-Risc Workstation) will be compiled in the near future. ASCII files containing the data can be distributed on demand. The program package can be distributed on any medium or via anonymous ftp (ftp.gbf-braunschweig.de, 193.175.244.2). Distribution with the EMBL CD-ROM is

planned. RESULTS AND DISCUSSION Structure

of previous data collection

preceding contributions, we described the data that are relevant for gene regulating processes (Wingender, 1988,1994; Wingender et ai, 1991). First of all, transcriptional regulation is exerted by distinct sequence elements of generally 4-25 base pairs, which are clustered to constitute a promoter or enhancer region of a gene. These eis elements are sites to be bound by positively or negatively irans-acting factors (transcription factors). These factors mediate their effect onto the basal transcription machinery around the initiation site through a complex variety of protein-protein interactions. Thus, the database has to contain information about regulatory elements in individual genes (i.e., their location), their DNA sequences, and the factors binding to these sequences. Also, the species from which the gene has been obtained has to be given as well as the source of the binding factor, the method by which a particular interaction has been identified, and the according references. This was the information content of the previous Table 1 (Wingender, 1988), which has now been supplemented by a denomination of the individual elements, by indication of the biological classification of the species the gene has been derived from, by the corresponding EMBL database identifier and accession number, and by free comments which, for instance, may depict physical parameters of the corresponding DNA-protein interaction. Further information about the transcription factors has been given by the previous Table 2 (Wingender, 1988): factor denominations (including synonyms) and the species this factor has been obtained from, size, structural motifs (previously focused on zinc finger structures), hints on postIn

translational modifications, and references. These data sets have been enriched by further structural characteristics such as the general blueprint of the DNA-binding domain, other conserved motifs, regions which are significantly enriched in some amino acid residues, and transcription activating regions. Also given now are hints on protein-protein interactions, cell specificity, functional properties as they were depicted by, e.g., Faisst and Meyer (1992), and identifier/accession numbers from other databases (EMBL, SwissProt, PIR).

Records These data sets have been transferred into two records, SITES and FACTORS (Fig. 1), the fields defining them are depicted in Fig. 2. In addition to the fields listed above, SITES contains an identifier, a unique accession number, and indicates the type of nucleotide sequence (DNA or RNA). In contrast to earlier descriptions (Wingender et al., 1991; Wingender, 1988), however, this record itself does not contain the name of the binding factor(s), its source, the methodological indication, or the references; rather, this information is accessible through a link to the FACTORS, METHODS, or REFERENCES records, the relation to which will be described below (Fig. 1 ). Also, the sequence(s) of any individual site is stored in a separate record. Similarly, alot of the information related to the FACTORS record is deposited separately (Figs. 1 and 2). Thus, the main records SITES and FACTORS have been flanked by the records SEQUENCES, METHODS, CELLS and SYNONYMS, FACTOR CLASS, INTERACTING FACTORS, respectively. FACTORS is also connected to

record tf

unique key char id[15]; unique key char ace[7 j; char st[2 ] ; key char descr(51]; key char tfembl[ll]; key char tfembl_acc[8]; long embl_pos; long pfrom; long pto; posl[31]; key char element[21]; key char sp[51]; char clas[81]; char

record seq

long seqlen; long ma_poa; key char sequence[71];

record fac

char fid[11]; unique key char face[11]; key char factor[21]; char mass[41]; char feat[101]; char prop[101]; key char fsp[51]; char fcell[101]; char embl[ll]; char awiss[11]; char pir[ll];

unique key

char char char

embl_acc[8]; swiss_acc[7]; pir_acc[7 j;

/* /* /* /* /* /* /* /* /* /* /* /* /*

/* /* /*

site identifier */ accesaion no. (incl. sequence type */

flag) */

description */

EMBL identifier */ EMBL accession no */ position of site in EMBL sequence

position from */ position to */ positions relative element */ species */ classification */

to

*/

*/

sequence length */ first position in matrix sequence */

0 if

none

*/

/* factor id */ /* factor accession # "I /* binding factor */ /* mol. mass */ /* structural features */ /* functional properties */ /* species */ /* cell specifity */ /* EMBL identifier */ /* Swissprot identifier */ /* PIR code */ /* EMBL accession no */ /* Swissprot accession no */ /* PIR accession */

record ref

unique key char rcode[ll]; key char author[195]; char title[195]; char source[51]; char vol[5]; char char char

year[5]; pages[17]; rename[4]; long redate;

/* /* I /* /* /* /* /*

'-

/*

reference code authors */ title */ source

*/

/

•/

volume */ year */ pages */ reference entered by reference entry date

*/ */

FIG. 2. Field definitions of TRANSFAC database records SITES (tf), FACTORS (fac), and REFERENCES (ref). Note that the sequence(s) of the regulatory element to be described in a SITES entry are contained within a separate record, which is linked by a one-to-many relation.

MATRIX, which gives nucleotide distributions in aligned binding sites. FACTORS are linked with the CONSENSUS DESCRIPTION record, which contains features of the binding site as deduced using the program Conslnd (Freeh et al, 1993). Record types SITES, FACTORS, FACTOR CLASS, and MATRIX are related with REFERENCES. The TRANSFAC retrieval program provides links to other programs and databases such as the Conslnspector (Freeh etal, 1993) and the EMBL database.

Relations As indicated in Fig. 1, most of the relations defined in the TRANSFAC database are many-to-many relationships. Thus, one reference may point at several sites or factors, and one site (or factor) may cite a list of references. Since homologous factors of different species are described by distinct records but (may) share their synonyms, FACTORS and SYNONYMS are also connected by a many-to-many relationship. The same reason makes several FACTORS records to be assigned to one binding matrix; however, one factor may own several MATRIX records which differ, e.g., in the quality of underlying sequence elements (see below). Because of the limited possibility to link records of the same type, interaction of two factors is implemented by the creation of an INTERACTIONS record, that holds the accession number and the name of the factors involved. There are only two examples of one-to-many relationships: a single site that may have been functionally defined may comprise more than one essential sequence element, but these sequences are connected to only one SITES record; and each factor belongs to only one class family, but each class obviously consists of several members.

Specific features of the

network model

In contrast to the common relational database systems, the network model has the advantage of establishing direct physical links between related records rather than connecting them through common fields. As a consequence, it saves redundant storage space and assures rapid access of related data. Because we have to consider at least 50,000 genes in the human genome and possibly similar numbers in other vertebrates, and still thousands of genes in yeast, each gene being controlled by several regulatory sequence elements, we shall have to deal with a fairly large amount of data in future. In a conventional relational database system, an increasing amount of data also increases the time necessary for data access through relationships, because one has first to search for the appropriate entry in the connected table, which takes at least a time period of 0(log(n)) where n is the number of entries in this table. In network database systems where related entries are connected by references that may be dereferenced in constant time, the time-to-access related data is independent of the number of entries in the database. On implemented systems for both models, one has to take into account data access time to external storage media, which might be size dependent, but because of the reduced need of data access it is still advantageous to apply the network model.

User

interface

The network database defines a fixed relation of records. Although relational databases tend to display tables composed of fields from different records related by common fields, we focused on the record as an entity. Every record type can be called from SAA-style (Standard Application Architecture) drop-down menus. Records are displayed in screens with all their fields. Related records can be retrieved from an active screen via buttons. If the selected owner record possesses many members, a list of these members is displayed to select the one of interest (Fig. 3). Having evoked many screens with related records, changing the entry in one screen will update the other screens with the related entries automatically. Selected fields of each record type are indexed, to allow retrieval of entries under various points of view. Keywords may be entered by hand or be selected from a pop-up list. A standard query language (SQL) module can either be started independently by SQL experts and users knowing the structure of the database, or started from the menu of the main program. TRP also provides a query module that allows the user to select from a Search screen individual fields of SITES and FACTORS records in any combination and may enter its respective search criteria direcdy or by selecting them from popup lists (Fig. 4). Also, the Boolean operators connecting the search criteria can be chosen. The module then translates these selections into a SQL query, which can be further edited if required for more specialized applications. Similarly, the contents of individual fields can be selected for output from the Show screen. The output can be directed to the screen or a printer or can be written into an ASCII file. Retrieval schemes and output formats are defined in ASCII scriptfiles, that can be modified by the user.

FIG. 4. Search and Show screens for complex retrieval routines. A. According to the data structure of TRP {Fig. 2), search selections can be made for the two main records SITES and FACTORS. In each case, search criteria may be entered by the user and can be combined by selectable Boolean operators in the Search screen. B. For output, individual field contents may be selected in the Show screen. Next to the names of records, arrowheads point to additional screens which display the corresponding fields for further output selections.

-*FIG. 3. Screen display of the SITES, FACTORS, and CONSENSUS DESCRIPTION records of TFP. A. The SITES

record basically displays information on a certain protein binding site of the gene given in the description field. The corresponding entry of the SEQUENCES record is depicted automatically. Information about the binding factors, their cellular sources (Cells), the methods by which this site has been identified, and the corresponding references can be invoked by pressing the appropriately labeled buttons. While scrolling through the database using the buttons First, Previous, Next, or Last, pop-up lists, such as that with the binding factors, can be kept open and will update their content automatically according to the SITES entry just displayed. For the example shown here, activating one factor in the pop-up list by mouseclick and pressing the Show option will invoke the complete FACTORS entry of this protein. B. Similarly, the FACTORS record is connected to those records giving additional information such as that of the binding sites of the factor shown on the main screen. C. The record CONSENSUS DESCRIPTION contains information about the consensus binding sequence of a particular factor because it has been generated by the Conslndex program (Freeh et al, 1993). The factors whose binding sites did contribute to this consensus can be listed as a pop-up table as well. The user may also display the sequences the consensus is based upon, their alignment, a graphical output of the consensus index along the aligned binding sites, and the weighted matrix that gives the nucleotide distribution values weighted for the contribution of the individual sequences to avoid statistical bias (see Freeh et al, 1993 for details).

Future

developments

Database programs handle only static data; they usually lack the possibility of processing those data that have been selected by a retrieval operation. Therefore, TRP will offer interfaces to other programs and databases. The selected data will be passed to spawned programs like the GENMON or GCG program packages, Conslnspector, SPLIT, or programs to calculate three-dimensional DNA structures. This way, theoretical analyses of transcriptional control can be based on a thorough study of all presently available relevant data. The consensus record already contains results from the program Conslnd (Freeh et al., 1993). DNA sequence segments will be obtained from the EMBL database, e.g., via the GENMON program package.

ACKNOWLEDGMENT The authors wish to thank Prof. John Collins for his continuous support and stimulating discussions. This work is part of a ring funding project supported by the German Federal Ministry of Science and Technology (01 IB 306 A).

REFERENCES CODASYL. 1978. Report of codasyl data description language committee. Information Systems 3,247-320. Codd, E.F. 1970. A relational model of data for large shared data banks. Communications oftheACM 13,377-387. Faisst, S., and Meyer, S. 1992. Compilation of vertebrate-encoded transcription factors. Nucleic Acids Res. 20, 3-26. Freeh, K., Herrmann, G., and Werner, T. 1993. Computer-assisted prediction, classification, and delimitation of protein binding sites in nucleic acids. Nucleic Acids Res. 21,1655-1664. Ghosh, D. 1990. A relational database of transcription factors. Nucleic Acids Res. 18,1749-1756. Ghosh,D. 1991. New developments of a transcription factors database. Trends Biochem. Sei. 16,445^147. Ghosh, D. 1992. TFD: The transcription factors database. Nucleic Acids Res. 20,2091-2093. Johnson, P.F., and McKnight, S.L. 1989. Eukaryotic transcriptional regulatory proteins. Annu. Rev. Biochem. 58, 799-839. Latchman, D.S. 1990. Eukaryotic transcription factors. Biochem. J. 270,281-289. Locker, J., and Buzard, G. 1990. A dictionary of transcription control sequences. DNA Sequence 1,3-11. Wingender, E. 1988. Compilation of transcription regulating proteins. Nucleic Acids Res. 16, 1879-1902. Wingender, E. 1990. Transcription regulating proteins and their recognition sequences. CRC Crit. Rev. Eukaryotic Gene Expression 1, 11-48. Wingender, E. 1993. Gene Regulation in Eukaryotes, VCH, Weinheim. Wingender, E. 1994. Recognition of regulatory regions in genomic sequences. J. Biotechnol 35,273-280. Wingender, E., Heinemeyer, T., and Lincoln, D. 1991. Regulatory DNA sequences: predictability of their function, Vol. 4, 95-108. In J. Collins and A.J. Driesel, eds., Genome Analysis—From Sequence to Function; BioTechForum. Adv. Mol.

Genet, Hüthig Buch Verlag, Heidelberg. Address

reprint requests to: Dr. E. Wingender Gesellschaft für Biotechnologische Forschung mbH Department of Genetics Mascheroder Weg 1 D-38124 Braunschweig, Germany Received for

publication April 13, 1994; accepted May 3,

1994.

Suggest Documents