discovering the secrets of dna

5 downloads 0 Views 2MB Size Report
Jun 27, 1985 - produce proteins from a nucleic-acid template-was a major step toward ...... Asp Thr Ala Arg Ile Gly Leu APa Ala &La. Here we have a ...
SPECIALISSUE

DISCOVERING THE SECRETS OF DNA

Sophisticated software tools are becoming increasingly important in helping biologists understand how nature operates. Symbolic pattern-recognition and artificial-intelligence methodologies are contributing to the- development of such software.

PETER FRIEDLAND and LAURENCE H. KEDES

The fundamental quest of biological science is to understand how nature functions. The discovery of the genetic code-the language a living organism uses to produce proteins from a nucleic-acid template-was a major step toward understanding a complex and intriguing biological process. However, this important discovery was only the beginning of the exploration. In particular, the problem of control-how cells “decide” which of the many thousands of genes should be “switched on” at any given time-is central to further understanding living organisms. Discovery occurs at two levels. The first is the primitive level, which, in the case of deoxyribonucleic acid (DNA), is the level of the four nucleotide bases that are attached to the sugar-phosphate backbone. The problem is to locate significant patterns of bases that provide control signals to the cellular protein-production machinery, a symbolic pattern-recognition task for which computers are particularly well suited. Since the total DNA of even a small virus can be several tens of thousands of bases long, discovering patterns, while often heuristic, is tedious and repetitive. The second level of discovery involves model building and theory formation, where the task is to develop an understanding of both the local and global structures of protein-production machinery and to determine how these structures relate to the functionality of the cell. This level is of great interest to the branch of computer science known as machine learning (a subfield of artificial-intelligence (AI) research). The MOLGEN-II ACM/IEEE-CS

1164

project

is supported

Joint Issue 0 1985 ACM

Communications of the ACM

by NSF grant

83.10236.

OOOl-0782/S5/1100-1164

750

The purpose of this article is to show how computers are currently being used to help biologists solve discovery problems on the first level and how they potentially can be used on the conceptually more difficult problems on the second level. The authors’ experience in these applications comes from their participation in the MOLGEN project, a research group begun in 1975 as a collaboration among computer scientists and molecular biologists at Stanford University. MOLGEN is a part of the Heuristic Programming Project (HPP), an organization led by Edward Feigenbaum of the computer science department with the goal of applying AI methodologies to real-world problems in science, medicine, and engineering. Current MOLGEN work focuses on model-building and theory-formation issues; previous research on experiment design led to the development of several useful tools for pattern recognition in DNA sequences. This article is divided into three self-contained sections. The first provides a tutorial introduction to the science of molecular genetics to allow the computerscientist reader to fully appreciate the problems of biological discovery. The second section illustrates the use of symbolic pattern-recognition tools in analyzing DNA sequences. The final section describes the MOLGEN research in scientific theory formation introduced in the previous paragraph. BIOLOGICAL

BASICS

A gene was originally defined as a purely metaphysical concept: the unit of heredity-for example, the trait for blue eyes versus brown eyes or the trait for a wrinkled

November 1985

Volume 28

Number 11

Special Issue

versus a smooth skin on a garden pea. However, molecular biologists discovered that a gene has an actual physical representation in the chromosomes, the hereditary material of cells. Every cell has one or more chromosomes, and each cell in any multicellular organism has a duplicate set of chromosomes. Genes are small, linearly arrayed regions of the chromosomes. Although chromosomes are highly coiled polymers complexed with a large array of proteins, the core of every chromosome is a continuous polymer of DNA. The DNA polymer, made up of only four building blocks, the nucleotide bases, carries in coded form all the information necessary for inheritance of specific traits, the genes. A gene is a region of the DNA molecule that encodes such a specific trait.

or point mutation, can affect genetic information (but not always; sometimes such mutations have no obvious effect and are considered neutral). Bases can also be deleted or inserted; short or long DNA segments can be lost or added; DNA segments can even be inverted. Such events are also mutations and usually (but not always) cause more drastic changes in heritable characteristics than do the simpler point mutations. In a formal sense, the smallest fundamental unit of heredity is a single nucleotide of DNA because it can alter a heritable trait. However, for the purposes of this discussion, we shall consider a gene to be a bit larger than that. We define a gene as any region of DNA that encodes a complete product or performs a specific function.

The Concept of Complementarity

The Products of Genes

DNA is a double-stranded heteropolymer and can be thought of symbolically as a continuous string of the four bases. The nucleotide base elements of the deoxyribose nucleotide molecules on one of the strands interact with their counterparts on the other strand in a precise and predictable way that is actually part of the copying mechanism. An A can only interact (through hydrogen-bond formation) with a T, and a G can only interact with a C. This simple relationship called base pairing totally defines the sequence of a second DNA strand if the sequence of the first strand is known. Thus, if a short segment of a DNA molecule has the sequence ATATAGCTCG, then the double-stranded molecule must look like this:

The products of the kinds of genes we are considering are proteins. Proteins are often enzymes-catalysts that convert one chemical compound to another-or are structural components of cells. An example of an enzyme would be one of the proteins in the pathway that enzymatically converts the simple amino acid tyrosine to the brown-skin pigment melanin. If the gene encoding the information to make this protein carries a mutation, then the protein will be absent or defective, and no melanin, or reduced amounts of melanin, will be made, leaving the individual with an albino skin color.

---ATATAGCTCG---

Strand 1

---TATATCGAGC---

Strand 2.

When DNA is replicated, or copied, a new DNA strand is synthesized using each of the old strands as templates-patterns to guide the formation of new complementary strands. But, given the rule of complementation, the sequences of the new strands are completely predictable. old

---ATATAGCTCC---

---ATATAGCTCG---

new

new

---TATATCGAGC---

---TATATCGAGC---

old

The Physical

Basis of Mutations

When the germ cells of an organism-the eggsor sperm-undergo cell division (multiplication), the DNA molecules in each cell are also copied. Although the copying mechanism that generates the new DNA strands is highly reliable, mistakes, mutations, do occur, usually at random. For example, when copying a sequence such as ATATAGCTCG one could end up with ATATACCTCG. Such a simple single base substitution,

November 198.5 Volume 28 Number 1I

How Does a Gene Carry Protein Coding Information?

Genes are regions of DNA, and proteins are the products of genes. Proteins are built from a fundamental set of 20 amino acids, and DNA carries the amino-acid coding information. If a single nucleotide base of DNA coded 1 amino acid, only 4 amino acids could be accounted for because there are only 4 nucleotides in DNA. Only 16 would be accounted for by 2 nucleotides (4 x 4) whereas 64 kinds of amino acids can be determined by 3 nucleotides in sequence (4 X 4 X 4). Therefore, a code consisting of at least 3 nucleotides would be needed to specify all 20 amino acids. In fact, a group of 3 nucleotides in sequence does code for 1 amino acid, and this 3-nucleotide or tripled sequence is called a codon. Since there are 64 possibilities for triplet combinations of four bases and there are only 20 fundamental amino acids, the code is obviously redundant and is referred to as degenerate. Several combinations of three bases are synonyms. No triplet encodes more than 1 amino acid, but many amino acids are encoded by more than one triplet, some by as many as six. In addition, three of the triplets do not correspond to any amino acid. Accordingly, when any one of these three so-called nonsense combinations occurs, it acts as a termination signal and signifies the end of the encoded protein. Transcription

and Translation

The DNA itself does not directly act as a template for the protein decoding/synthesizing machinery. What

Communicationsof the ACM

1165

Special issue

/ :::., . A9

Sugar-phosphate backbone with the variable sequence of bases attached.

0

..,i' . .,::p:: : :..

0

0

.9

The double-stranded helical structure of DNA, as first presented by James Watson and Francis Crick in 1953. Schematic Diagram of DNA

(a) In the early 1950s James Watson and Francis Crick attempted to construct a model that would fit the information already known about DNA. By piecing together various data, they were able to show that the structure of DNA is a long, entwined double helix.

1166

Commut~ications of the ACM

(b) DNA is a macromolecule made up of four kinds of bases attached to a sugar-phosphate backbone. The bases carry genetic information, and the sugar and phosphate groups perform a structural role. Although the molecular building blocks that go into the assembly of DNA are few and relatively simple-consisting of only four types, deoxyadenosine, deoxycytidine, deoxyguanosine, and deoxythymidine. usually abbreviated as A, C, G, and T-the molecules are polymers, and their combination adds large-scale complexity to any attempt to determine general patterns of organization. A can pair only with T, and G only with C. Thus the order of bases along one strand detemtines the order along the other. The strands run in opposite directions.

November 1985

Volume 28 Number 11

Special issue

5’ END

DEOXYTHYMIDINE

DEOXYTHYMIDINE

Regulation

I H NH,

DEOXYCYTIDINE

\\p/

happens, in fact, is that a complementary copy of one of the two strands of DNA is synthesized out of ribose nucleotides to generate an RNA copy of the gene in a process called transcription. This RNA copy is then decoded by the protein synthesizing machinery in a process called translation. Since this RNA carries the protein code, it is called messenger RhJA (mRNA). Transcription is highly regulated and is probably the major mechanism that distinguishes one kind of cell in an organism from another. That is, the set of genes being transcribed in one cell distinguishes the kinds of proteins made by that cell from the proteins made by another cell-a muscle cell makes contractile proteins, and a red blood cell makes globin proteins. This differential gene activity is often controlled by regulation of transcription of one set of genes versus another.

H

DEOXYADENOSINE

l l

l

DEOXYGUANOSINE

l

3’ END

The chemical structure of a portion of one strand of DNA

(c) In the repetitive sugar-phosphate sequence that forms the backbone of the molecule, each phosphate group is attached to the 5’ carbon (the fifth carbon in the ring) of one deoxyribose sugar, and to the 3’ carbon (the third carbon in the ring) of the deoxyribose sugar in the adjacent nucleotide. Thus the chain has a 5’ end and a 3’ end.

November 1985 Volume 28 Number 11

of Transcription

The transcription of an mRNA is a very precise activity. The mRNA molecule always starts at one precise nucleotide and ends at another exact point. How does the cell know when to transcribe a particular mRNA, and equally important, how are the start and stop points determined? The underlying assumption of all molecular geneticists working in the field of gene regulation is that DNA sequences, in addition to carrying the wellunderstood amino-acid coding information, also carry the transcription regulating information. What kind of information, besides the protein code, is embedded in the DNA? Among the kinds of signals presumed to exist are information telling the mRNA synthesis machinery that a gene is active; that the first nucleotide of the mRNA starts at a specific place; that the polymerization of the mRNA starting at that place proceeds linearly from only one of the two strands and moves in only one direction along the DNA; that the mRNA ends at a specific site, and that the mRNA synthesizing machinery should stop (or fall off the DNA) at that point.

Gene transcription is often highly modulated under the influence of complex factors in the environment of the cell-for example, hormones and metabolic substrates. In most cases, the regulatory signals that interact with the cellular and outside environment control transcription of the DNA. For the most part, those areas of DNA that control the rate and location of transcription have been found to lie upstream from the point of initiation of RNA transcription and upstream from the protein coding regions. In other words, these signal regions are themselves usually not transcribed into mRNA. A schematic drawing of a gene with some of these control landmarks is seen in Figure 1 (p, 1169). The existence of these signal regions has been inferred from two major kinds of laboratory experiments. When mutant organisms are found that express a normal protein, but at abnormally low or high levels, direct structural analysis of the DNA has often revealed point mutations in upstream flanking sequences. The

Communicationsof the ACM

1167

Special Issue

WA GCC GCG GCU

ala

AGA AGG CGA CGC CGG CGU

ar9

AAC AAU

GAG GAU

UGC UGU

GAA GAG

CAA CAG

asn

asp

w

9lu

cm

GGA GGC GGG GGU

9lY

CAC CAU his

AUA AUG AW ile

The 20 amino acids are coded by 61 triplet codons, with 3 additional codons serving as stop signals. The codon triplets

CUA CUC CM3 CUU leu

AAA AAG ‘YS

AUG met

WC UUU phe

CGA CCC CCG CCU pro

AGC AGU UCA UCG UCG UCU

ACA ACC ACG ACU

se

thr

UGG

UAC UAU

GUA GK GUG GUU

tw

rYr

val

UAA UAG UGA

stop

shown here are the ones that would appear in the mRNA molecule.

The Genetic Code

Proteins are built from amino acids. Shown here is the linear sequence of amino acids for a relatively small protein, the hormone insulin, one of the first proteins for which the pri-

mary structure was determined. The two chains are held together by disulflde bridges (S).

The Structure of a Protein

second kind of experiment involves the direct chemical modification of genes in the test tube using recombinant-DNA and genetic-engineering techniques applied to DNA fragments containing a gene. Laboratory

Modification

of Genes

Modified genes can carry induced point mutations or, more commonly, precisely defined deletions of segments of DNA flanking a protelin coding region. These modified DNA molecules can be reintroduced into cells, and their transcription and modulation studied. An example of such an experirnent showing the effect of a series of induced mutations is shown in Figure z (p. 1170) in which the hsp70 gene, a gene known to exist in the cells of many organisms including man, has been experimentally dissected by sequential deletions. The hsp70 gene has the peculiar property of not being active under normal circumstances, but is highly transcribed almost immediately in response to noxious stimuli such as raising the temperature of the cells from the normal 37” C of a mammalian organism to 42” C.

1166

Communications of the ACM

By deleting DNA from one end of a large fragment, a point is reached at which the gene reintroduced into cells no longer responds to heat shock. As the deletions get closer and closer to the starting point of transcription, the ability of the gene to support the synthesis of any mRNA whatsoever is abruptly diminished. Simple experiments of this kind have allowed molecular geneticists to identify potential regulatory signals in the 5’ flanking region of many animal-cell genes. Patterns in Promoters

There is enough similarity in the pattern of DNA sequences necessary for correct transcription of the hspi’0 gene and other genes (as determined by direct experimental testing) to allow some general principles to emerge. First, the control region at the beginning of the gene, called the promoter, is made up of more than one DNA signal. Some of these signals, or closely related DNA sequences, are also present in most animal-cell genes studied to date. About 20-25 bases upstream from the transcription initiation site lies a TATA box, a

November 1985

Volume 28

Number 11

Special Issue

RNA POLYMERASE RNA Translation

i Protein Major Processes in Protein Synthesis

RNATranscription(left) An enzyme, RNA polymerase, opens up the DNA, and the nucleotides are assembled into RNA using one strand of DNA as a template. The polymerization starts at a specific place, proceeds linearly in only one direction (3’ to 5’ direction) along the DNA, and ends at a specific site. RNA is similar to DNA except for two differences in its nucleotides and one difference in its structure. Instead of deoxyribose, RNA contains ribose; instead of thymine, RNA contains uracil (U); and lastly RNA does not possess a regular helical structure and is usually single stranded. Adapted from Curtis, H. Biology. 4th ed. Worth, New York, 1983, p. 298.

5’

short stretch of 6-8 bases with the sequence that gives it its name plus several additional As or Ts. Upstream a variable distance, there is often one or more CAAT boxes,also named for its resident sequence CCAAT. Additional sites appear to be responsible for the tissue-specific mode of gene expression. Fgr example, a highly specialized sequence upstream from the TATA potential control

box is required for the heat-shock response of the hsp70 gene. This can be directly’demonstrated by chemically linking this upstream heat-shock sequence to a test gene that normally is unresponsive to heat stimulation. When such a chimeric gene is reintroduced into cells, it responds efficiently to heat. A closer look at the relatively simple and short heat-shock promoter from the

mRNA coding segment

start of transcription

b-

protein start

protein coding segment

‘\

terminator

FIGURE1. A Schematic Drawing of a Gene

November 1985

Volume 28 Number 11

Communications of the ACM

1169

Special Issue

mRNA --+ --+ -+ . :--

... .

37”

I-

+

42”

Needed for expression

..........:.. ..::.:. .......................... ..: :rr:::~:::::::::::,a.irrir.!rr..r.r~::::::::~~:::::~::::: + t

(a) A gene under the control of ,a heat-responsive regulatory region can be introduced into mammalian cells. The gene enters the nuclear DNA and behaves normally. Sets of daughter cells (the new cells resulting from cell division) can be subjected to normal (37” C) or heat-shock temperature (42” C). By examining RNA from such cells by electrophoresis on gel slabs, only the RNA from the heat-shocked cells should contain a heat-shock related messenger RNA.

Place heat-shock segment in front of test gene

(b) Using this kind of mRNA assay, the gene bearing the heat-shock regulatory signals can be analyzed in structural detail. The DNA can be chemically modified by recombinantDNA technology to delete small segments of the regulatory regions. Different modified molecules are then checked for their ability to engender a heat-shock mRNA response. By generating and testing a series of overlapping DNA modifications, the location of Potential regulatory sequences can be inferred. A direct test would include the demonstration that the potential regulatory element confers a heat-shock response after it is inserted in front of an otherwise heat-shock nonresponsive gene.

FIGURE 2. The Effect of a Series of Induced Mutations on the hsp70 Gene

genes of a number of organisms has revealed a common DNA sequence pattern (see Table I). A general rule is that nucleotide sequences conserved in evolution among diverse species or present at homologous locations on many genes in the same organism represent functionally identical segments. Hence, the common pattern of heat-shock promoters observed among species as diverse as fruit flies, toads, and molds is likely to be a key element in the heatshock regulatory machinery. The most striking thing about the pattern (see Table I) :is the presence each time (or almost each time) of a spatially conserved arrangement of bases. Furthermore, these bases exhibit another feature common to DNA regulatory signals, dyad symmetry. The sequence C - - GAA - TTC -- G on one

1170

Communications of the ACM

strand reading left to right is identical to the sequence on the complementary strand reading right to left. In chemical terms this means that the two DNA strands have a region of stereochemical identity-that is, are mirror images of each other (see Figure 3). Such segments with their two-fold axis of symmetry are often the site of binding for specific protein molecules that regulate the transcription of DNA in their vicinity. Exactly how such DNA binding proteins exert their influence is not understood, but models built by analogy with simpler bacterial systems where the data are more complete suggest that some binding proteins can interfere with the ability of the transcription machinery to polymerize RNA (repression), while others can augment that activity by boosting the polymerization machinery

November 1985

Volume 28 Number II

Special Issue

TABLE I. Heatahsck efement SequeRcB

Gene Fruit fly hsp70

Distance to TATA

ATAAAGAATAlTCIA6AA CTCEAfAAATTTCTCT66 TTCTCGTTfCTTCEAfAf CCCTCSAATfTTCfCfAA I III III I 6ACTffAATSTTCTfACC ATCTCCAATTTTCCCCTC I III III I ATCCAGAA6CCTCYAGAA CTCTASAAGTTTCTASAG TTCTA6AGACTTCCASTT I III III I CCCCAfAAACTTCCAC66 CCfAACAAAATTC6AfAE T6CCCfTATTTTCTAfAT I III III I CCCfAGAAGTTTCCTCTC I III III I TTCCffACTCTTCTAEAA I III III I

hspS8

hsp83

hsp22

hsp26 Toad hsp70

hsp30

Soybean hspl7

Mold DIRS- 1

TATA

215 144 36 15

TATAAATACAEECSC

45 12

TATAAATACECCCCC

35 25 15

TATAAAACCAfACGC

147 46 26

TATAAATAECCACCS

97

TATAAATAACCfCAC

13

TATAAAAfCA6CfTC

CTCfAfAAAfCTCfC6AA CTCfCGAATCTTCCfCEA CTCfCGAAAGTTCTTCff CTCEfCAAACTTCfffTC I III III I TECCAEAAETTfCTACCA CTCffEAACfTCCCA6AA I III Ill I

204 194 139 72

TATAAATACAfCfEG

124 14

TATAAAACTCCT666

ATCCCfAAACTTCTAETT 6TCCAEAATCTTTCTEAA TTTCAEAAAATTCTAETT CCCAAEfACTTTCTCfAA I III III I

129 98 78 28

TTTAAATACCCCATf

TTTTACAATCTTCTAEAA TTCTACAACATTCCAACA I III III I C EAA TTC 6

179 169

TATAAATACTCECAC TATAAA

This table displays a number of nucleotide segments located in the putative control region of many heat-shock genes from four organisms. The heat-shock element, whose sequence was identffed as being a control element by experiments of the type shown in Figure 2, is the fruit fly hsp70 segment closest to the TATA box. The other elements are all present one or more times near known heat-shock responsive genes, but it is not dear how many of these sequences function in the heat-shock response. Pattern analysis of this type leads to the identification of potential regulatory elements, but these must be examined by direct experimentation.

--------

C-GAA-T,()-6

--------

--------

&-c;;-;;;-c

--------

4

FIGURE3. Stereochemical Images of Two Strands of DNA

November 1985

Volume 28

Number II

(activation), or even by displacing repressor molecules. In the case of the heat-shock promoter, a protein molecule with the expected binding and activating properties has been isolated from animal cells. The nature of this protein, as well as some of the few other known regulatory proteins, is that it is made up of two identical subunits arrayed in a symmetrically opposing orientation. The most plausible model to account for this organization relates to the ability of such proteins to bind to DNA strands exhibiting dyad symmetry. The

Communications of the ACM

1171

Special Issue

protein is capable of wrapping the stereochemical dyad in its symmetrical arms, hence, the importance of recognizing dyad symmetries in DNA. We have described but a few of the known types of DNA signals discovered to date. For the most part, these signals have been elucidated by a combination of experimental manipulation of the kind we have outlined, and an empirical, manual analysis of the DNA sequences implicated. However, as the library of known DNA sequences has grown from only a few thousand nucleotide base pairs to over five million base pairs, manual methods of analysis have grown increasingly cumbersome. The last decade has seen a great deal of research into the development of computational tools that can be used to recognize significant features on DNA sequences. SEQUENCE-ANALYSIS DNA-Sequence

Libraries

Functionality

Although literally hundreds of computer programs have been written over the last decade to automate BIONET

1172

is a trademark is a trademark

l

TOOLS

We have seen that the primary data of molecular biology are the DNA sequences themselves-the collection of As, Cs, Gs, and Ts that encode the information stored in genes. Determining that sequence from a given piece of DNA was a long and cumbersome process, taking on the order of man-years to elucidate a dozen or so base pairs. However, two new methodologies, known as Maxam-Gilbert and Coulson-Sanger sequencing, became practically useful in 1974. These two methods made possible the determination of :200-400 base pairs of DNA-sequence information in only a few hours. Scientists, using these sequencing methods and, often, specialized computer programs that aid in the assembly of these several hundred base-pair pieces into an entire genome, are now routinely sequencing complete viruses up to 50,000 base pairs in length. Two major, authenticated, and annotated collections of DNA-sequence data are now maintained: GENBANK,e an NIH-supported collaboration between Los Alamos National Laboratory and BBN, and the EMBL database, collected and distributed by the European Molecular Biology Laboratory. Each database contains over four million base pairs of sequence information (much of it complementary as the two databases have begun active sharing of the sequence collection work load) in standardized, computer-readable form. Access to the data is through distribution of magnetic tapes and floppy disks, direct computer-to-computer and computer-to-terminal transfer over telephone lines, and computational resources such as BIONET,e run by IntelliGenetics as an NIH-sponsored Research Resource, which provide access to both sequence-data and sequence-analysis programs for the nation’s academic molecular biologists.

GENBANK

various tasks of DNA-sequence analysis, the types of common analyses are relatively few and well understood. (The reason for hundreds of programs relates to the lack of knowledge of existing work, diversity of machines and languages, and personal views on issues of pattern-matching algorithms and I/O behavior.) The major classes of functionality are the following:

of Los Alamos of IntelliGenetics.

Communications of the ACM

National

Laboratory

and BBN.

l

l

Translation and location of potential protein coding regions. This process takes the DNA sequence 3 base

pairs at a time and produces the amino-acid sequence that would result from transcription of the DNA into RNA and from translation of the RNA into an amino-acid sequence. Normally this analysis must be done in six frames on a double-stranded DNA molecule, in each of three phases on each of the two strands of the DNA. A further sophistication of the analysis is to first look for the codon for the amino acid methionine, which initiates every protein, then determine if the nearest stop codon is at least 30 or 40 base pairs away, and only then translate the DNA. This eliminates much of the nonsense translations that result from a complete analysis. Inter- and intrasequence homology searching. This process compares a DNA sequence both to other DNA sequences and within itself to find repeated patterns of nucleotide bases. For example, locating precisely the same pattern of As, Gs, Cs, and Ts in front of 10 different genes would be highly significant evidence toward a possible control region. Finding identical sequences of DNA within the same gene is also a biologically “interesting” discovery. Two major complexities arise in homology searching. First, important homologies are not necessarily identical; often they differ by a few substitutions of one nucleotide for another or deletion of a few nucleotides. Second, in any string of several thousand letters in a four-letter alphabet, certain common patterns are bound to emerge. Evaluating the statistical rarity of these patterns (especially when they may be inexact matches) is essential to evaluation of significance. Inter- and intrasequence dyad symmetry searching.

Dyad symmetries in a sequence indicate possible regions where control proteins may bind to a DNA molecule. In addition, it should be noted that DNA, in its double-stranded form, is quite rigidly locked into a double-helical shape. However, RNA is singlestranded and free to bend and twist in solution. Dyad symmetries in DNA point toward areas of inverse complementarity on the RNA; these are areas where portions of the RNA may “stick” to themselves and form structures called hairpin loops. Many biologists theorize that these structures also play a part in control of DNA replication, transcription, or translation. Like homologies, dyad symmetries may be partial. Statistical evaluation is important. In addition, when RNA loops are theorized, approximate free energy calculations of the hydrogen bonds holding the loops together may be made as a guide to the biological reality of the loops.

November 1985

Volume 28 Number 11

Special Issue

Some Illustrations

Analysis of codon frequency, base composition, and dinucleotide frequency: Substantial redundancy exists in

of Sequence Analysis

The remainder of this section illustrates some common sequence-analysis operations. There are many programs used to perform these various tasks. We have chosen for our examples one of the most highly developed, the SEQe program of IntelliGenetics. Four DNA sequences from possible human oncogenes (genes that may cause cancer) are used to illustrate all but the final operation. HUMONCEJ and HUMONCT24 are the complete genes with control, coding, and unknown regions included; HUMONCl and HUMONCZ are the coding regions only from the two genes. The restriction enzyme mapping operation is illustrated on a small, artificially constructed piece of DNA used to link larger molecules together (see Figures 4-q).

the genetic code. Different organisms or different genes within organisms seem to favor certain threeletter codons over others in nonrandom ways. Complete analysis of the distribution of these three-base patterns (in all three phases and both strands), as well as analysis of relative frequency of the four nucleotides taken individually and as pairs, is often requested by scientists searching for meaningful signals in DNA. Locating AT- or GC-rich regions. DNA molecules vary in the relative prevalence of AT and GC base pairs. Certain theories have been propounded about the meaning of “richness” of one over the other, and automated search for such regions (with the definition of richness under the control of the biologist) is considered a useful feature of sequence-analysis programs. Mapping of restriction enzyme sites. Crucial to the success of modern experimental molecular biology was the discovery of enzymes called restriction endonucleuses, which were found to cut DNA selectively at patterns of four, five, or six bases (say the pattern ATTA or ATCGAT). Mapping of these sites on a molecule is considered automatic before manipulation of that molecule.

MOLGEN: APPLICATIONS OF AI TO MOLECULAR GENETICS Historical

Background

The initial aim of the MOLGEN project was to provide intelligent assistance to the scientist designing an experiment in a biological laboratory. The major AI issue was planning-that is, the development of a design for a svnthetic or analvtic experiment before the experiSEQ is a trademark of IntelliGenetics. SEQ was originally written in the MOLGEN project at Stanford University and licensed to IntelliGenetics where it has been significantly improved and enhanced over the last five years.

HUMONCEJ The

translated

sequence

is 54

2-l CCC

GGG

CCG

CAG

GCC

CTT

GAG

GAG

CGA

TGA

CGG

AAT

ATA

AGC

TGG

TGG

TGG

TGG

Gly Pro Gln Ala Leu Glu Glu Arg . Arg Asn Ile Ser Trp Trp Trp Trp Pro Gly Arg Arg Pro Leu Arg Ser Asp Asp Gly Ile . Ala Gly Gly Gly Gly Arg Ala Ala Gly Pro Gly Ala MET Thr Glu Tyr Lys Leu Val Val Val. Gly

Pro

108

81

GCG CCG TCG GTG TGG GCA AGA GTG CGC TGA CCA TCC AGC TGA TCC AGA ACC ATT Ala Pro Ser Val Trp Ala Arg Val Arg . Pro Ser Ser . Ser Arg Thr Ile Arg Arg Arg Cys Gly Gln Glu Cys Ala Asp His Pro Ala Asp Pro Glu Pro Phe Ala Val Gly Val Gly Lys Ser Ala Leu Thr Ilo Gin Leu Ile Gin Asn His Phe 162

135

TTG TGG ACG AAT ACG ACC CCA CTA TAG AGG TGA GCC TGC GCC GCC GTC CAG GTG Leu Trp Thr Asn Thr Thr Pro Leu . Arg . Ala Cys Ala Ala Val Gln Val Cys Gly Arg Ile Arg Pro His Tyr Arg Gly Glu Pro Ala Pro Pro Ser Arg Cys Val Asp Glu Tyr Asp Pro Thr Ile Elu Val Ser Leu Arg Arg Arg PrO Gly'&La 216

189

CCA GCA GCT GCT GCG GGC GAG CCC AGG ACA CAG CCA GGA TAG GGC TGG CTG CAG Pro Ala Ala Ala Ala Gly Glu Pro Arg Thr Gin Pro Gly . Gly Trp Leu Gin Gln Gln Leu Leu Arg Ala Ser Pro Gly His Ser Gln Asp Arg Ala Gly Cys Ser Ser Ser Cys Cys Gly Arg Ala Gin Asp Thr Ala Arg Ile Gly Leu APa Ala &La

Here we have a translation in three phases of a section of one of the strands of a possible oncogene. Putative starting sites are indicated by “MET” and stop signals by “.“. Note the tong open reading frame (possible coding region) that

begins at position 27 and continues to the end of the sequence. The translation uses the three-letter code for amino acids; the SEQ program also allows translation in the alternative one-letter code.

FIGURE4. Translation

November 1985

Volume 28 Number II

Communications of the ACM

1173

Special Issue

HUMONCl The

--

HlJMONC2

regions 'I .I

of homology

in the

two sequences

are

ATGACGGAATATAAGCTGGTGGTGGTGGGCGCCGTCGGTGT(~GGCAAGAGTGCGCTGACC ATGACGGAATATAAGCTGGTGGTGGTGGGCGCCGTCGGTGTGGGCAAGAGTGCGCTGACC ****

*

ATCCAGCTGATCCAGAACCATTTTGTGGACGAATACGACCC(~ACTATAGAGGTGAGCCTATCCAGCTGATCCAGAACCATTTTGTGGACGAATACGACCC(~ACTATAGAGG-ATTCCTA *

***

**

l

**

**

GCG-CCGC-CGT--CCAGGTG CCGGAAGCAGGTGGTCA-TTG % = 88.652 P( 141, 125:)

= *

136 139

.OOOE+OO *

l

*

*

E =

.OOO

*

***

*

*

270 GGAGAGGAGGGGG-CAT-GAGGGGCATGAGAGGTACC 122 GGA-AGCAGGTGGTCATTGATGGG---GAGACGTGCC % = 70.270 P( 37, 26) == .195E-06 E= .062 *

*

*

**

304 154

*

275 GGAGGGGGCATGAG-GGGCATGA 37 GGTGTGGGCAAGAGTGCGC-TGA % = 73.913 17) == .306E-04 E = p( 23,

296 58 9.757

This is a comparison between the two theorized coding regions of the possible oncogenes. Dashes within the sequences indicate possible deletions; asterisks above the two sequences indicate mismatches. The printout also shows percent of total match, the a priori probability (the P value) of the match at any given location on the sequence, and the

total number of expected matches (the E value) of that likelihood or greater for the entire comparison. The E value provides heuristic guidance as to the potential biological significance of the match. E values greater than 1 normally indicate an “uninteresting” homology.

FIGURE5. Homology Search

ment was actually carried out. Two planning systems resulted: one based on a cognitive study of how humans design experiments; the other relying more on computer science theoretic grounds on how to optimize human performance. Both systems showed substantial promise for further development as practical tools. The MOLGEN project also made progress in knowledge representation (how to effectively store diverse forms of biological knowledge in an intelligent computer system) and in knowledge acquisition (how to get that knowledge from human scientists in the first place). This research resulted in the development of a knowledge-engineering tool called the Unit System and formed the basis for KEE,e a fully supported and documented commercial tool that is being used in current MOLGEN work (see below). A basic lesson learned during research in the HPP is that relatively simple reasoning or inference methods could be most effectively applied to problem solving when they were accompanied by a large amount of diverse expert knowledge. For this reason, projects in the HPP usually go through a life cycle of early emphasis on AI research, followed by a gradual transition to

1174

the particular domain of real-world expertise involved in the project at hand. In 1982 we decided that MOLGEN had reached a state of development where the basic research goals in planning and design had been satisfied. Therefore, a search was begun for new problems of interest within the molecular biology domain. This search led to the second level of discovery problems discussed in this article, the task of scientific theory formation. Research in Scientific

Theory

Formation

KEEis a trademark ofIntelliCorp.

MOLGEN research in scientific discovery consists of several major phases. The first phase, which we have recently completed, was a detailed study of the reasoning processes used by Charles Yanofsky and colleagues during a 1%year endeavor to discover the regulatory processes operating in the tryptophan (trp) operon system of E. coli. The second phase, now under way, is to build a knowledge base that adequately represents the currently accepted theory of the biological systemadequacy here referring to the ability to explain laboratory observations and to predict logical consequences resulting from the theory. The third phase, now in the planning stage, will involve creating the reasoning

Communications of the ACM

November 1985

Volume 28 Number I1

Special issue

mechanisms that permit theory modification, extension, and occasionally complete re-creation-all of this still within the context of the Yanofsky trp-operon system. Finally, we hope to generalize and verify our results by applying the theory-formation system to a different biological system for which a verified regulatory theory is yet unknown. Initial

pers in refereed journals. Dozens of hours were spent questioning Yanofsky about speculation, insights, and blind alleys that occurred during the project. In addition, complementary and sometimes differing points of view were obtained from Imamoto, Hiraga, Lee, and Zurowsky, among others. To review the history of the trp-operon project very briefly, Yanofsky’s group started with the view that the genes that code for proteins in the tryptophan biosynthetic pathway (the proteins that are used by a cell to synthesize the amino acid tryptophan) were controlled by classic Jacob-Monod regulation-a model based on the regulatory mechanism for the Zacoperon of E. coli (Figure 10, p. 1178). A well-defined promoter and operator were also found preceding the genes in the E. coli trp operon, and a repressor protein was identified. However, over the course of studying the behavior of the system, various anomalies were discovered that could not be explained

Problem Analysis

The first six months of the MOLGEN discovery project were spent in attempting to understand the scientific reasoning that went into building the theory of genetic regulation in the trp-operon system. This study involved both extensive literature research and personal interviews with most of the key scientists involved in the Is-year project. Yanofsky’s group has been uncommonly thorough in documenting both the methodology and the results of their project-our literature study involved the detailed analysis of over 50 scientific paHUMONCT24

The

of

regions

dyad

IIIIII 3383 % = 71.429

28,

are

CCGCCCCTGCCGGTCTCC-TGGCCTGCG

3414

P(

symmetry

II

:III:Il

3440

I :IIIIII

GGCGGG-ACCCTCAGGGGGAGTGGACGC

20)

= .939E-04

3357

E = 1.647

G = -37.300

GG-ACATGG-AGG-TGCCGGA-TGCAGG-AAGGAGG

3480

II

III

II

III

III

II

:IIIII

3510

IIIII:

3461 CCGTGTTCCCTCCGACG-ACTGGCGTCCGGTCCTCT 3427 % = 72.222 P( 36, 26) = .lOlE-04 E = .178 G = -33.300

GAGG-TGCCGGA-TGCAGG-AAGGAGG--TGCAG

3487

IIII

III

II

:IIIII

lllll:

3515

IIII

3452 CTCCGACG-ACTGGCGTCCGGTCCTCTGGCCGTC 3420 % = 70.588 P( 34, 24) = .121E-04 E = .212 G = -30.037

CGGATGCAGGAAGG-AGG-TGCAGACGGAAGG--AGGAG

3494

III

III

III1

III

III

III

I III

3464 GCC-CCGT-GTTCCCTCCGACGACTGGCGTCCGGTCCTC % = 71.795 28) = .423E-05 E= .074 G = -36.800 P( 39, .

IIII 3587 % = 71.053

P(

38,

.

III

l:III

II

Ill

I l:l:I

= .587E-05

These are the regions of intrasequence homology in one of the complete possible oncogenes. Bars between the two sequences show potential base pairing by hydrogen bonds (A-T or G-C); a colon shows G-T pairing that is energetically very slightly favorable. In addition to the P and E values

E = .103

3650

I III

AGGGCCCCACTGACCCGAGGTC-GTCGGGA-AGGAAGG

27)

3428

.

TCCCAGGGAGGCTGTGC-ACAGACTGTCTTGAACATCC

3614

3528

IIIII

3552

G = -17.600

from the homology search, a G value shows a rough estimation of free energy contribution of the possible RNA hairpin loop. G values that are -14 or less indicate possibly stable structures.

FIGURE6. Dyad Symmetry Search

November 1985

Volume 28

Number 11

Communications of the ACM

1175

Special issue The

codon

frequency

phased

from

position

1

HUMONC2

TTT-Phe TTC-Phe TTA-Leu TTG~Leu

3 ( 2 (

CTT-Leu CTC-Leu CTA-Leu CTG-Leu

0 ( 0.0)

ATT-Ile ATC-Ile ATA-Ile ATG-MET GTT-Val GTC-Val GTA-Val GTG-Val

1.6) 1.1)

0 ( 0.0) 2

(

1.1)

TAT-Tyr TAC-Tyr TAA- . TAG- .

1 ( 8 (

1 ( 2 (

0.5) 4.2)

0 ( 0.0) 0 ( 0.0)

CCT-Pro CCC-Pro CCA-Pro CCG-Pro

2 4 0 0

( ( ( (

1.1) 2.1) 0.0) 0.0)

CAT-His CAC-His CAA-Gln CAG~Gln

11

0.5) 4.7) 0.5) 2.6)

ACT-Thr ACC-Thr ACA-Thr ACG-Thr

2 6 0 3

( ( ( (

1.1) 3.2) 0.0) 1.6)

AAT-Asn AAC-Asn AAA-Lys AAG-Lys

5 ( 1 ( 10 (

2.6) 0.5) 5.3)

0 ( 0.0)

GCT-Ala CCC-Ala GCA-Ala GCG-Ala

2 7 1 1

( ( ( (

1.1) 3.7) 0.5) 0.5)

GAT-Asp GAC-Asp GAA-Glu GAG-Glu

6 9 3 11

3.2) 4.7) 1.6)

2

(

1.1)

0 ( 0.0) 10 ( 5.3) 1 9 1 5

2

( ( ( (

(

1.1)

0 ( 0.0) 15 ( 7.9)

This is a codon-frequency table of the coding region of one of the possible oncogenes (shown for the phase believed to be the ceding phase). Note that some of the distributions

0.5) 1.1)

0 ( 0.0) (

5.8)

TGT-Cys TGC-Cys TGA- . TGG-Trp

3 3 1 0

( ( ( (

1.6) 1.6) 0.5) 0.0)

CGT-Arg CGC-Arg CGA-Arg CGG-Arg

0 ( 0.0)

( ( (

( 5.8)

AGA-Arg AGG-Arg

0 ( 0.0) 1 (

0.5)

GGT-Gly

1 7 1 3

0.5) 3.7) 0.5) 1.6)

GGC -Gly

GGA-Gly GGG-Gly

( ( ( (

seem close to random (say the choice of codons for the amino-acid Serine (Ser)), but some seem very nonrandom (look at Valine (Val) or Leucine (Leu}).

FIGURE7. &Jon-Frequency Determination

by even severe modification of the Jacob-Monod model. These experimental discoveries led to the proposing, testing, and evolution of a new regulatory model. Yanofsky concluded that transcription of the iv operon was regulated by a controlled termination site called an attenuator that is located between the operator and the gene for the first enzyme in thle pathway. Attenuation, as the process is called, is a radically new regulatory mechanism in two major respects. First, secondary structure (hairpin loops) on the messenger RNA acts to control when transcription is stopped. Certain loops shape the RNA so that the polymerase “falls off.” Second, it involves a coordinated interaction between transcription and translation. The ribosome, previously thought only to functison as the machinery used by a cell to produce an amino,-acid sequence from an RNA template, functions to “d.ecide” which loops form in the RNA by where it “stalls” on the molecule. (See Figure 11, p. 1179, for more details.) Our analysis of the Yanofsky group’s research had two major goals: to collect the knowledge of biological structures and techniques studied and employed during the research, and to elucidate something of the reasoning process used by scientists during the theoryformation process. We believe that the first goal has been accomplished; this will be verified as we complete our simulation knowledge-base building task described below. We have also made substantial progress in our study

1176

Communications of the ACM

of how humans perform the task of scientific theory formation. There is concrete evidence of at least four types of reasoning: Data-driven reasoning. The process by which empirical evidence directly leads to extensions and modifications in a regulatory theory. This form of reasoning was especially evident in the early stages of the trp-operon project when Yanofsky began by considering Jacob-Monod regulation in the lac operon of E. coli. 2. Theory-driven reasoning. The process by which the growing theory itself suggests logical further extensions to the theory. For example, when it was conclusively shown that the leader region of the operon had a definite regulatory function, the logical conclusion was that a regulatory protein binding site had to exist in that region. This led to a great deal of experimental work to find the site. 3. Analogy to other biological systems. The process by which related work on other biosynthetic operons contributes to developing the regulatory theory. For example, parallel work on the histidine operon (another operon for the biosynthesis of amino acids in E. coli) in Kanai’s laboratory both confirmed parts of the Yanofsky attenuation theory and suggested extensions to that theory. 4. Analogy to distantly related systems. The process by which seemingly unrelated systems can help in the 1.

November 1985

Volume 28 Number II

Special Issue

theory-construction process. For example, some find it helpful to think of a DNA strand as a railroad track and DNA polymerase as a railroad engine moving down that track. Regulatory theories can be proposed by imagining ways in which the railroad engine can be stopped, diverted, or slowed down. Although this particular analogy sounds a bit frivolous when applied to regulatory genetics, it was interesting to note that Alexander Rich of MIT also found it useful when discussing control of gene transcription: “It’s as though the DNA is a railroad track and the polymerase zips along it until it finds a promoter . . .” (Z-DNA moves toward real biology. Science 222, 4623 (Nov.

4, 1983), 496).

HUMONCT24 LIMITS: The

CG rich

1

regions

1610

We believe that it is significant that the various scientists involved in the trp-operon project made different uses of the reasoning methods just discussed. Yanofsky himself maintained a global view of the entire emerging system and tended to be very theory driven in his work. Other scientists graduate students and postdoctoral scholars tended to concentrate on much narrower parts of the system and were more data driven in their reasoning. The use of analogy was a very personal matter, both in choice of analogies and in the role they played. Our overall conclusions from this study of theory formation are that, first of all, the basic inference processesare indeed understandable and potentially simu-

1669

are

1620

1630

1640

1650

GGGcccTccT TGGCAGGTGG GGCAGGAGAC CCTGTAGGAG GACCCCGGGC 1660

CGCAGGCCCC TGAGGAGCG HUMONC2 The

CG rich

510

reqions

are

520

GCTGCGGAAG CTGAACCCTC 560

530

540

550

CTGATGAGAG TGGCCCCGGC TGCATGAGCT

570

GCAAGTGTGT GCTCTCCTGA

This figure examines both the noncoding (from bases 1 to 1669 of HUMONCT24) andcodingregions (HUMONCZ) ofa possible oncogene for CG richness. The biologist has defined richness to mean at least 75 percent CG composition for at least eight bases in a row. Note that the very beginning of

the gene (bases 1 to 15 0) and the region immediately in front of the coding region (bases 16 10 to 1669) are very heavily GC rich. The gene itself has only a few regions of moderate GC richness.

FIGURE8. Location of AT- and GC-Rich Regions

November 1985

Volume 28

Number I1

Communications of the ACM

1177

Special Issue

Linear

M13MP5-LINKER

Length HinfIII NlaIII

1

EcoRI' EcoRI

16 14

EcoR

21

24

HpaII

EcoR

I-

I

45

48

18

EcoRI' EcoRI'HpaII

EcoRI

I

I

I-

I

EcoRI' HpaII

EcoRI

I-

I

37

40

I

ATTCCGGAATTCCGGAATTCCGGA48 26

29

32 34

42

EcoRI‘

Hind111 AluI

EcoRI'

EcoRI

I 49

HpaII

ATGACCATGATTACGAATTCCGGA24 10

25

= 78

EcoRI' HpaII

EcoRI

I‘

I

70

73

I

I

ATTCCCCAAGCTTGGGAATTCCGGAATTCA78 50

59

65

57

75

67

the place on the molecule’where enzyme.

This figure shows the restriction enzyme map of a small artificial linker molecule used to connect two larger pieces of DNA. The enzyme symbols (ECORI, for example) point to

it would be cut by the

FIGURE9. RestrictionMapping

OPERATOR PROMOTER --

GENES CODING FOR PROTEINS

‘c

CAP SITE

REPRESSOR DIRECTION SITE I RNA POLYMERASE ENTRY SITE OPERATOR “OFF”

RNA POLYMERASE

OF TRANSCRIPTION--)

\ REPRESSOR

The Jacob-Monod model states that transcription, the step in protein synthesis in which a messenger RNA is built from a DNA template, begins when an enzyme called polymerase binds to a specific spot in the front of the gene called a promoter. Another molecule called a repressor, which is made elsewhere in the cell and is activated by an excess of

the end product of the gene (in the Jacob-Monod

model,

lac), can bind to a nearby site on the DNA called an operator. If it does so, it blocks the polymerase from binding to the promoter, thereby stopping the protein-producing machinery at the very first step.

FIGURE10. Thelac Operonof E. co/i

1170

Communications of the ACM

November 1985

Volume 28

Number 11

Special issue

Formation of this stem and loop results in the termination of transcription

A. High tryptophan

level

Ribosome is stall

B. Low tryptophan

When tryptophan is abundant (A), the leader region (segment 1) of the rrp mRNA is fully translated. Segment 2 interacts with the ribosome, which enables segments 3 and 4 to base pair. This base-paired region somehow signals RNA polymerase to terminate transcription. In contrast, when ttyptophan is scarce (B), segments 3 and 4 do not interact because the ribosome is stalled at the rrp codons of segment 1. Segment

level

2 interacts with 3 instead of being drawn into the ribosome, and so segments 3 and 4 cannot pair. Consequently, transcription continues. (After D.L. Oxender, G. Zurawski, and C. Yanofsky. Proc. Nat/.Acad. Sci. 76 (1979), 5524.) Adapted from Stryer, L. Biochemistry. 2nd ed. W.H. Freeman and Co., San Francisco, Calif., 1981, p. 677.

FIGURE11. Model for Attenuation in the E. co/i frp Operon

latable within a computational system, and, second, a problem-solving architecture that provides for multiple forms of reasoning, each contributing to an overall growing solution, is best for our initial consideration. The blackboardarchitecture has been developed by AI researchers for just this sort of situation, where contributions from multiple sources of knowledge are used jointly to solve a complex problem (Figure 12, p. 1180). Overall

System Architecture

In simplest form, the MOLGEN scientific discovery system can be thought of as containing two components: a performance element and a learning element. The performance element includes a computational model of how nature operates in the biological system being studied. In the case of the trp-operon work, it represents the current best view on the mechanism of genetic regulation. The model encompasses both structural knowledge (factual information about each of the entities important to regulation) and functional knowledge (how the various entities affect each other and the system as a whole). The structural and functional

November 1985

Volume 28

Number 1 I

knowledge taken together form the knowledge base of the trp-operon system. In addition, the performance element includes inference methods that allow the knowledge base to be used to simulate the activity of the’trpoperon system in response to various internal and external conditions. The learning element of the discovery system evaluates how well the performance element can be used to model the actual biological system of concern. It compares the result of simulations to laboratory data and decides when and how to modify the performanceelement knowledge base (perhaps using the blackboard architecture previously discussed). A change to the knowledge base, in the form of an addition, deletion, or alteration to either the structural or functional knowledge, can be said to be a discovery. The bulk of our work so far has concentrated on the performance element of the MOLGEN discovery system. The major computational problems we are attempting to resolve are in qualitative simulation (where the goal is to predict generic behavior and trends rather than numeric quantities) and in making sure that the

Communications of the ACM

1179

Special Zssue

THE “PURE” BLACKBOARD

PARADIGM

I

KNOWLEDGE

SOURCE, I

a 0

BLACKEiOARD

a

I

KNOWLEDGE

SOURCE,

I

Blackboard THE GLOBAL DATABASE All problem-solving state data Knowledge Sources (KS’s) Data-triggered agents Respond to changes in blackboard May include rules, procedures, tables, etc. Modify blackboard state May do I/O Both domain and control knowlege KS’s Protocols KS’s modify only the blackboard Only KS’s modify the blackboard KS’s are “procedurally” independent

FIGURE12. The “Pure” Blackboard Paradigm

knowledge base is “transparent” enough so that both structural and functional information can be “understood” and modified by the learning element. Knowledge-Base

Construction

The trp-operon system knowledge base needs to contain at least the following classes of knowledge: knowledge about transcriptional and translational mechanisms and objects (operators, promoters, repressors, ribosomes. etc.); knowledge about the trp operon in particular (specific DNA structures in the operon, enzymes found in the system, etc.); knowledge about the laboratory methods used to discover information about the trp-operon system (nutrition experiments, mutation experiments, sequencing experiments, etc.); heuristics relating to regulabon, biology in general, and several potential analog systems (including, for example, the railroad tracks analogy discussed earlier). l

l

l

l

1180

Communications of the ACM

We have completed a substantial portion of the construction of the knowledge base. The work has been done using the KEE System, a product of IntelliCorp, on the Xerox 1108 Scientific Information Processor. KEE was selected as the best currently available framebased, hierarchical, object-oriented knowledge acquisition and representation tool. Each of these three concepts will be discussed below. The Xerox 1108 is a high-performance scientific workstation with a large (over z MB) main memory, a microcoded processor optimized to the Interlisp programming language, and a double-width bit-mapped display. Three major paradigms have emerged as representational frameworks for knowledge-based systems: logic based, rule based, and frame based. Logic-based systems acquire and store information as statements in the predicate calculus. They are especially useful when knowledge is easily formalizable, and relatively consistent and complete. Rule-based systems acquire and store knowledge as production (IF-THEN) rules. They are especially useful when the bulk of the knowledge is

November 1985

Volume 28

Number 11

Special Zssue

heuristic and procedural. Frame-based systems acquire and store knowledge within blocks called frames that each represent an entity (either structural or functional) in the knowledge base. A given frame can have many different attributes (commonly called slots). Frame-based systems are especially useful when the knowledge is a mix of factual and procedural information, as is the case of our work in scientific discovery. Frames are commonly organized into a hierarchy of classes, subclasses, and individual entities in the knowledge base. The major power of the hierarchy comes from the concept of inheritance. This means that attributes (slots) of a frame are passed down to its children, either directly in the form of values for that attribute or indirectly in the form of knowledge about what constitutes “legal” values for the attribute. For example, consider the small knowledge base shown below: ANIMALS

A,‘\ FISH

MAMMALS

/\ SHARKS

/\ TROUT

DOGS

ELEPHANTS

BOWSER The ANIMALS frame has attributes like HEIGHT, WEIGHT, SKIN COVERING, and DOMESTICITY. None of these attributes are “filled in” with actual values (since the class of entities called ANIMALS does not have fixed values for those attributes), but certain restrictions on the values are known. For example, it may be defined that WEIGHT is a numerical quantity measured as a positive number of grams and that DOMESTICITY is either TAME or WILD. The MAMMALS unit inherits these attributes and might assign specific values, further narrow given restrictions (for example, by stating that all MAMMALS have WEIGHTS greater than 200 grams), or define new attributes common only to MAMMALS (say a CHILD-GESTATION-TIME attribute). This process continues through the hierarchy until an actual instance of an entity is defined, at which time all of the attributes are given specific values. So the BOWSER frame will have a DOMESTICITY of TAME, a WEIGHT of 15,050 grams, etc. In real-world knowledge bases, inheritance becomes considerably more complex than the situation just described. Restrictions sometimes become only guidelines for values (e.g., a penguin is a bird except it cannot fly). Sometimes it is useful for slots to have several values up to some maximum number of values. Finally, the ability to manage a knowledge-base organization where a frame can have several parents adds a great deal of

November 1985

Volume 28

Number II

expressive power since an entity can be viewed from several different perspectives in several different contexts. The following figures (Figures 13-15, pp. 1182-1185) will illustrate some of these features in the KEE trp-operon knowledge base. Functional Knowledge We previously noted that a knowledge base typically contains both factual and functional information, in other words, both “what” and “how” knowledge. KEE is an object-oriented knowledge representation system; this means that to the system user both types of knowledge are stored and retrieved in an identical manneras the values of slots within units. For example, take the slot we saw above, the switch state of a molecular switch. For a given molecular switch, the person constructing the knowledge base might specify that it was always “on.” But that would not be a very sensible or flexible way to define the property. It is much more interesting to define the switch state in terms of its functional relationship to the current molecular environment. The functional knowledge can be specified either as a procedure in the LISP programming language or within an English-like rule language supplied with KEE. In addition, we are constructing a biologicalspecific process language that will be translated into the more general KEE rule language. Whatever the form of the functional knowledge, it can then be “attached” to a given slot. In the case of switch state, a set of rules relating switch state to cofactor state, switch structural integrity, and the like forms this attached piece of functional knowledge. When the user asks KEE to retrieve the switch state of a particular molecular switch, if that value is not explicitly resident in the switch-state slot, the system notes the attached procedure and attempts to determine the value of switch state given the current state of the knowledge base. A full discussion of functional knowledge is beyond the scope of this overview article. We suggest that interested readers peruse the Fikes and Kehler article referenced in the Further Readings section. Simulation of the trp-Operon System Simulation of the trp-operon system is a simple process once the knowledge base is complete and accurate. Two different inference methods, forward and backward chaining, are employed. In forward chaining, a fact is stated, say, that an experiment has found a destructive mutation in the promoter region of the operon, and the system attempts to determine the ramifications of that fact. It does so by finding all rules or other procedural knowledge that includes the state of the promoter in the conditional parts of their reasoning. That knowledge is applied, possibly resulting in further changes to the knowledge base. The system recurs with these new facts driving forward inference. The cycle continues until no new implications can be found. The state of the knowledge base after this process represents the system’s prediction of the effects of a destroyed

Communications of the ACM

1181

Special issue

SWITM.RINDING.SITES MOLECLKARJ;W(TC”ES COFACTORS A TRP-SYNMETASEALPHA 1Rd~SYNTHETASEBET.A PLXYPEPTLOES -ANiiRiNLLAlE-SYNTHETASEJXMPONENTJ ANitiRiNILAlE-SYNTHElASE.COMPONENT1 TRP%6LYPEPTlOE- - - - - - lRP-R.PCLYPEPTIOE.1 I TRP-R.OPERATORBlNOlNGSITE- - - - - - TRP-R.OPERA,OR.R,NOINCSnE. 1 PROTMBINOINCS(TES< TRP-R.TRPBINOINOSITE- - - - - - TRP-R.TRPBINOINOSITE. 1

SPACERS-

MISCMOLECULES

TRP.SPACER.l

SHIKIMATE-6-PHOSPHATE ANT”RANlLArE RJOOLE-GLYCEROL-PHOSPHATE

Each word in this diagram represents a single frame (also called a unit in KEE) in the tryptophan-operon knowledge base. The most general units are on the left of the diagram; the most specific ones on the right. A solid line connecting two units, for example, from PROTEINS to ENZYMES, indicates a class-subclass relationship. A dotted line indicates a prototype-instance relationship; for example, TRP-R.l is a specific model of the trp repressor created during a single

FIGURE

13. The Current Hierarchy of Part of the frp-Operon Knowledge Base

promoter region. Forward chaining can be viewed as “data-directed” reasoning. Backward chaining occurs when a question like “What could cause unregulated RNA synthesis to occur?” is asked. The system looks at the conclusion portions of all functional knowledge to find where that inference could possibly be made. If it finds any, then it checks the conditional portions of the knowledge to see if it is already true. For conditionals whose value is not yet known, it asks what would make those values true

1182

Commut~icafiom of fhe ACM

simulation run. Note that units can have more than one parent in the knowledge base; this is very important as it allows individual entities to be viewed in several different perspectives. For example, OPERATORS are both DNA.SEGMENTS (viewing them in their structural sense as pieces of DNA) and MOLECULAR .SWITCHES (viewing them in their functional sense as control sites that can be used to turn a process on or off).

and recurs. This cycle continues until all necessary conditions have been determined or it has been shown that the proposed state (in this case unregulated RNA synthesis) could never occur in the current model. Backward chaining can be viewed as “theory-directed” reasoning.

Further Research The preceding has described our efforts to construct the performance element of our discovery system. When

November 1985

Volume 28

Number 11

Special Issue

we are satisfied with its ability to accurately model Jacob-Monod repression in the trp operon, we will proteed with development of the learning element of the system. To accomplish this, we intend to pursue the following specific tasks: 1. Build an interface that allows the simulator to be used to explain observations that are indeed explainable without changes to the current model. For example, “I have observed a mutation that causes constitutive (uncontrolled) production of

m1

-

nm

l

tryptophan. How can that be explained within the Jacob-Monod model?” 2. Begin to recognize when observations are “interesting.” Interesting here has one of the following broad meanings: l

l

A seemingly direct contradiction to the existing theory. A statistically rare occurrence (one that is understandable by the current theory, but one that should not occur very often).

I =

Jn i t. : MOLECULAR.SWITCHES in know 1 edge base 3-May-85 16:12:46 2reated hy K&RF on tiodified by FRIEOLAND on 27-Jun-85 14:19:44 :$-1perclasses: FUNCTIONAL .ENTITIES

TRP-SIM. 1

’ s l-1b c:1 a s s e s : REGULONS J PROMOTERS 1 GENES ? OPERATORS, Member of: (CLASSES in KE GENERICUNITS ) leniberSlot: Inheritance ‘4 ,a 1 IJ e s

:

\lemberSlot:

Inheritance

*COFACTOR from MOLECULAR.SWITCHES : OVERRlDE.VALUES U ti kn 0 w n

*COFACTOR.POLARITY : OVERRlDE.VALUES

f t=o 111MOLECULAR .SWITCHES

'v'a 1 IAe C 1 ass : ( 0 NE , 0 F A 11: T I ‘4 ATE Values : Un k n 0 wn flemberSlot: Inheritance Values:

REPRESSORS

ClE&I: T I V A T E :I

:~COFACTOR.SITES f r 111 m MOLECULAR .SWITCHES : OVERRlDE.VALUES U n k n c)w n

'IlemberS 1 gt : %WITCH.STATE f ram MOLECULAR.SWITCHES I nher i tance : OVERRlDE.VALUES V a 1 1-1e 121 a s:s: : ( 0 NE , 0 F 0 N 0 F F ) Cardinality .Miti: 1 Cardinalit.y.Max: 1 Va lcres : Unkncrwn

This display of the MOLECULAR. SWITCHES unit first provides some information about the unit itself: its creator, last modifier, parent (superclass) in the hierarchy, and direct children (subclasses). Then a few of the slots, which provide knowledge about the concept of molecular switches, are shown. This distinction between knowledge about a unit (sometimes called bookkeeping knowledge) and knowledge about a concept or structure represented by the unit is sometimes confusing, but always important. Each slot has a name, the name of the unit in the hierarchy in which the slot is either defined or last changed, an inheritance mode, and a value. In addition, a slot may have various kinds of restrictions on its possible values, for example, a list of possible values, a numerical range, or a specification

that a slot must be “filled” with the name of another unit in the knowledge base. All of the slots shown in this figure were originally defined in this unit and have an inheritance mode of OVERRIDE.VALUES (this means that children of this unit will inherit any values defined here unless they choose to provide specific, more narrowly specified values that still fit the slot’s restrictions). The COFACTOR. POLARITY slot has been given the restriction that a legal value must be either ACTIVATE or DEACTIVATE (or both since in some rare cases a molecular switch can be activated or deactivated by the same cofactor). The SWITCH .STATE slot is restricted to precisely one value, ON or OFF (that is the purpose of the “cardinal&y” specification).

FIGURE14. A Portion of the MOLECULAR.SWITCHESUnit

November 1985

Volume 28

Number II

Communicationsof the ACM

1183

Special Issue

+I

-I

‘I

I

II

: REPRESSORS in know 1 Gdye base TRP-SIM. 1 Zreatad by KUUND on 22-&pr-85 15:49:613 ilodified by liAP,P on lQ-May-85 14:18:58 S~uperc 1 asses : MOLECULARSWITCHES, PROTEINS SI.J~~ lasse:z : TRP-R (CLASSES i n K f3 GENERICUNITS ) Member of: Jn it

“ACTIVE .SITES from REPRESSORS Inheritance : OVERRIDE.VALUES va 1c1e s : Un k m0 wn

i!GmbGrSlot:

lemberSlot: ‘%UFACTOR f r rJm MOLECULAR.SWITCHES I nher i t.ancG : OVERRlDE.VALUES v 3 11-J G S : Un k I?o wn :kCOFACTOR.POLARITY Inhari tance : OVERRIDE.VALUES ValllGCla:S;3: (ONE.OF ACTIVATE va IlUGS : ACT:IVATE I

dGmbGrSlOt.:

from

DEACTIVATE)

*COFACTOR.SITES from I nher i t.ance : OVERRIUE.VALUES v.3 1lsJGS : U n kn c)lwn

MOLECULAR.SWITCHES

dGmbGrSlot:

lemberSlot:

Inhwi

tancG

VZI~LJGC~FISS:

REPRESSORS

*C13MPONENT.POLYPEPTIOE.RATIOS : OVERRlDE.VALUES LIST

from

PROTEINS

V a 1 I-JGs : U n k n 0 wn 1Gmk G rS 1 ot : *CUMPONENT.POLYPEPTlllES

f ram PROTEINS

Inheritance : OVERRlDE.VALUES 'da lues : Unknown 1GnlbGrSlOt: *COMPONENTS from PROTEINS I nher i tancG : OVERRlDE.VALUES v 3 1 U Gc 1 a S S : (ONE.OF TRP csi32) ValiJGs: Unknown !GmbGr:~lot: Inheritance: ValueClass:

*CONCENTRATION OVERRIDE

Inheritance: ‘v’ a 1 1-JG c 1 3 3

ru PHYSICAL.ENTITIES

( ONE , 0 F H I GH L OV )

Cardinaliry,Min: Cardinality.Max: $1.3 1 1-Je S : Un kn o wn lGmbGrSlOt:

fro

1 1

:*CONSTRUCTION .METHOD from PHYSICAL .ENTITIES OVERRIDE :3 : ( ONE ( 15F A I; 12R E 13ATE P R I MI T I 'v'E P 0 LYME R )

Cardinality.Min: 1 Cardinal ity , Max : 1 'I DG:31::r i b G:3 h 0 w a mG rlib G r Comment: V~~IJGS: Unknown !GmbGrSlOt: :*MOLECULAR.WEIGHT Inheritance: OVERRIDE ValueClass: INTEGER Cardinality.Min: 1 Cardinality , Max : 1 il’o3 1 1-JG S : U n k n 0 wn

1184

Communications of the ACM

f ram

l:t11j G12t. [j G1213mKI11:2G:s "

PHYSlCAL.ENTITIES

November 1985

Volume 28 Number 11

Special Issue

l

An observation currently unpredictable by the current model because the model is either not detailed enough or is incomplete. The observation in this case must correlate with the model because an important object of the model is involved or it relates to an effect predicted by the model.

such a system can assist the scientist in at least the following ways: l

l

Build a mechanism for postulating extensions or corrections to the current theory: a constrained regulatory theory generator. The overall approach to this mechanism is perhaps the most interesting problem in our work. In discussions with other computer scientists, two different notions of reasoning have evolved: “or” reasoning, where the theory construction process consists of hierarchical refinement of abstract ideas into more detailed ones, and “and” reasoning, where the theory is built up in little pieces at many different levels simultaneously. We see strong evidence for both types of reasoning within Yanofsky’s project. In fact, as stated above, the global model of Yanofsky’s laboratory is a hybrid one. Individual graduate students performed “and“ tasks, filling in details of seemingly unrelated pieces of the model. Yanofsky was the master “or” reasoner, slowly building a hierarchical model of the new regulatory mechanism. Build a mechanism for evaluating alternative theories. This would include rating the theories based on plausibility, selectability, completeness, significance, and so on. We hope that the evaluation process produces information that will be useful in discriminating among the possible theories. Test the entire structure on the evolving trp-operon regulatory system. Experiment with different initial knowledge bases to see how the discovery process is altered by the availability of new techniques, analogous systems, etc. Potential Utility and Generality of the MOLGEN Discovery System As the preceding example shows, our view of the MOLGEN discovery system is as an “intelligence amplifier” for the biological scientist, designed not to replace human reasoning, but to augment it. We believe that

FIGURE15. The REPRESSORSUnit (left) Here we see that REPRESSORS has inherited slots from MOLECULAR.SWITCHES, PROTEINS, and PHYSICAL. ENTITIES, has explicitly stated that the value of the COFACTOR. POLARITY slot first defined above in the MOLECULAR.SWITCHES unit is ACTIVATE, and has de fined a new slot called ACTIVE.SITES. So, when the trpoperon discovery system wishes to consider repressors as types of proteins, it need only worry about properties like polypeptide ratios; when it is reasoning about repressors as molecular switches for regulation, it is concerned with prop erties like cofactors and switch states.

l

Thoroughness. It is very difficult for a biologist to have any real confidence that all major ramifications of a proposed addition or modification to a scientific theory have been explored. Lack of prejudice. Scientists -sometimes remain in a particular theoretical framework of reasoning for too long. A computational system, although maintaining its own “prejudices” in the form of its driving heuristics, can at least ensure that many different viewpoints are represented. Coordination of reasoning. It is difficult for humans to focus simultaneously on two or more different subproblems within a single, more complex problem. In the trp-operon example, this was seen in the need to coordinate the role of the ribosome binding site in the leader region with the secondary structures at the end of the leader region. A compound problemsolving architecture like a blackboard system helps to ameliorate this difficulty.

It is clear that much research needs to be done before we can claim to have even a rudimentary scientific theory discovery system within the narrow domain of regulatory genetics. However, we believe the framework of qualitative simulation combined with reasoning about the structural and functional knowledge used to model the system being simulated is generalizable to a wide variety of domains. Once the underlying theories and models of a domain are represented as a knowledge base of the form described in this article, the same reasoning methods we use in regulatory genetics should be applicable. We wish to thank our many colleagues who have contributed to the work described in this article. Particular thanks to Peter Karp for providing the scenario of a discovery program and to IntelliGenetics for providing illustrations of the SEQ program.

Acknowledgments.

FURTHER READINGS JamesWatson’s classic book, The Molecular Biology of the Genefrom W. A. Benjamin (New York, 1970) is a comprehensive overview of molecular genetics. The most comprehensive descriptions of computer programs for DNA sequence analysis may be found in special issues of the journal Nucleic Acids Researchpublished in January 1982 (Volume 10, Number 1) and 1984 (Volume 12, Number 1) and available as collections entitled The Applicafions of Computers fo Research on Nucleic Acids from IRL Press (Oxford, England]. An excellent review of the Yanofsky work on attenuation appears in “Attenuation in the Control of Expression of Bacterial Operons,” Nature, Volume 289, February 26, 1981, pages 751-758. The original MOLGEN project is described in “The Concept and Implementation of Skeletal Plans” by Peter Friedland and Yumi Iwasaki in the Journal of Automated Reasoning, Volume 1, Number 2, 1985, pages 161-208.

November 1985 Volume 28 Number II

Communicationsof the ACM

1185

Special Issue

REPRESENTATIVE PROBLEMSCENARIO Since our work is still in the early stages, the following scenaPi0 is clearly speculative. However, it illustrates (in a translated-to-English form) the kinds of reasoning we believe a theory-formation system should be able to perform. We have also attempted to indicate the kinds of knowledge used to reason at each stage. (Human input is indicated by the “4’ symbol, system output by “>,‘I and comments are enclosed in square brackets.) < Consider a normal E. coli growing in a @-poor medium. > Understood: Trp operon is activated. Synthesis of biosynthetic enzymes occurs to promote growth. [The system uses its description of a functional trp operon to initialize and execute a simulation of the described experiment. It then displays a summary of the results of that simulation.] < Correct. Now consider a similar situation, but with a mutation in the promoter region of the trp operon, and no growth occurs. > Understood: Transcription is disabled, preventing cell growth because of frp starvation. [The system understands the technique we have mentioned, knowing it can produce a certain range of effects, e.g., lack of functionality in a given region of DNA. The system propagates this effect through its simulation and again summarizes the result.] c How would you explain the above cells growing normally? > We can assume that the mutation you mention has no deleterious effect. [Above, the system realized the mutation technique could produce several different types of mutants, but only told us about the one it presumed we would be most interested in. Here it is asked to account for behavior different from that which it predicted, and can do so by examining these other mutants.] < Now consider a mutation somewhere in the leader region that causes constitutive synthesis. How would you explain that? [Constitutive synthesis means constant or unregulated production of the end product of the genome.] > The repression mechanism has been disabled in some A good text on methods for machine learning and discovery is Machine Learning: An Artificial Intelligence Approach by Michalski, Carbonell, and Mitchell from Tioga Publishing Company, Pal’0 Alto, California. Three articles that discuss thfe basic methods for knowledge representation (including a detailed description of KEE) may be found in th.e September 1985 issue of CACM in the section entitled “Special Section on Architectures for Knowledge-Based Systems” (Volume 28, Number 9, pages 902-941). CR Categories and Subject Descriptors: 12.1 [Artificial Intelligence]: Applications and Expert Systems--medicine and science; 1.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods: 12.6 [Artificial Intelligence]: Learning: 1.2.8 [Artificial Intelligence]: Problem

1188

Communications of the ACM

way. Most likely it is that the operator no longer allows repressor binding. [Above we discussed a form of simulation-based forward reasoning that the system might use. Here we note that backward reasoning may be appropriate as well. The system starts with a general principle (the disabling of repression), attempts to figure out what effects might have caused this, and correlates these effects with elements of the experiment description.) < Now we observe the following: A mutation is introduced in the leader region. Trp enzyme production is still regulated by trp in the medium; however, the #p-production rate in a &p-poor environment is significantly higher than normal. These mutants fail to recombine with trp-E. > I do not understand this. The mutation seems to map into the region between the operator and the first structural gene. This region has no known regulatory function, and hence a mutation here should not affect trp production, although it is possible that this mutation affects transcription or translation speed. Thus I postulate this piece of DNA has some function, possibly a positive effect that the mutation enhances, but probably a negative effect that we have disabled. [We are now into a discovery mode of proposing explanations to a phenomenon that requires some changes in our model of the regulation of frp. The system knows that a mutation could either have deleted a functional site of which it was currently unaware, or added some new functionality. Its biological heuristics tell it the former is considerably more likely.] < What types of negative effects might we be disabling? > Perhaps there was translation or transcription termination or pausing at the mutated site: or perhaps this site changed the elongation rate of polymerase or ribosomes; or perhaps it produces a repressor protein; or perhaps the mutations introduced a promoter into the leader region, although this is not likely because the promoter would have to be regulated by trp. [These are speculations on additional structures and functions in the leader region of the molecule. They come from a constrained theory generator. The obvious next step is to rank these theories in order of plausibility and then suggest experiments to differentiate among the possibilities.] Solving, Control Methods and Search; I.6 (Simulation and Modeling]; [Life and Medical Sciences]: biology General Terms: Design, Theory Additional Key Words and Phrases: AI applications to molecular biology, discovery, theory formation

1.3

Authors’ Present Addresses: Peter Friedland, Dept. of Computer Science, Knowledge Systems Laboratory, Stanford University, 701 Welch Road, Palo Alto. CA 94304; Laurence H. Kedes, Dept. of Medicine, Veterans Administration Medical Center, Stanford University. 3801 Miranda Avenue. Palo Alto, CA 94304. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

November 1985

Volume 28

Number 11