Jun 27, 1985 - produce proteins from a nucleic-acid template-was a major step toward ...... Asp Thr Ala Arg Ile Gly Leu APa Ala &La. Here we have a ...
SPECIALISSUE
DISCOVERING THE SECRETS OF DNA
Sophisticated software tools are becoming increasingly important in helping biologists understand how nature operates. Symbolic pattern-recognition and artificial-intelligence methodologies are contributing to the- development of such software.
PETER FRIEDLAND and LAURENCE H. KEDES
The fundamental quest of biological science is to understand how nature functions. The discovery of the genetic code-the language a living organism uses to produce proteins from a nucleic-acid template-was a major step toward understanding a complex and intriguing biological process. However, this important discovery was only the beginning of the exploration. In particular, the problem of control-how cells “decide” which of the many thousands of genes should be “switched on” at any given time-is central to further understanding living organisms. Discovery occurs at two levels. The first is the primitive level, which, in the case of deoxyribonucleic acid (DNA), is the level of the four nucleotide bases that are attached to the sugar-phosphate backbone. The problem is to locate significant patterns of bases that provide control signals to the cellular protein-production machinery, a symbolic pattern-recognition task for which computers are particularly well suited. Since the total DNA of even a small virus can be several tens of thousands of bases long, discovering patterns, while often heuristic, is tedious and repetitive. The second level of discovery involves model building and theory formation, where the task is to develop an understanding of both the local and global structures of protein-production machinery and to determine how these structures relate to the functionality of the cell. This level is of great interest to the branch of computer science known as machine learning (a subfield of artificial-intelligence (AI) research). The MOLGEN-II ACM/IEEE-CS
1164
project
is supported
Joint Issue 0 1985 ACM
Communications of the ACM
by NSF grant
83.10236.
OOOl-0782/S5/1100-1164
750
The purpose of this article is to show how computers are currently being used to help biologists solve discovery problems on the first level and how they potentially can be used on the conceptually more difficult problems on the second level. The authors’ experience in these applications comes from their participation in the MOLGEN project, a research group begun in 1975 as a collaboration among computer scientists and molecular biologists at Stanford University. MOLGEN is a part of the Heuristic Programming Project (HPP), an organization led by Edward Feigenbaum of the computer science department with the goal of applying AI methodologies to real-world problems in science, medicine, and engineering. Current MOLGEN work focuses on model-building and theory-formation issues; previous research on experiment design led to the development of several useful tools for pattern recognition in DNA sequences. This article is divided into three self-contained sections. The first provides a tutorial introduction to the science of molecular genetics to allow the computerscientist reader to fully appreciate the problems of biological discovery. The second section illustrates the use of symbolic pattern-recognition tools in analyzing DNA sequences. The final section describes the MOLGEN research in scientific theory formation introduced in the previous paragraph. BIOLOGICAL
BASICS
A gene was originally defined as a purely metaphysical concept: the unit of heredity-for example, the trait for blue eyes versus brown eyes or the trait for a wrinkled
November 1985
Volume 28
Number 11
Special Issue
versus a smooth skin on a garden pea. However, molecular biologists discovered that a gene has an actual physical representation in the chromosomes, the hereditary material of cells. Every cell has one or more chromosomes, and each cell in any multicellular organism has a duplicate set of chromosomes. Genes are small, linearly arrayed regions of the chromosomes. Although chromosomes are highly coiled polymers complexed with a large array of proteins, the core of every chromosome is a continuous polymer of DNA. The DNA polymer, made up of only four building blocks, the nucleotide bases, carries in coded form all the information necessary for inheritance of specific traits, the genes. A gene is a region of the DNA molecule that encodes such a specific trait.
or point mutation, can affect genetic information (but not always; sometimes such mutations have no obvious effect and are considered neutral). Bases can also be deleted or inserted; short or long DNA segments can be lost or added; DNA segments can even be inverted. Such events are also mutations and usually (but not always) cause more drastic changes in heritable characteristics than do the simpler point mutations. In a formal sense, the smallest fundamental unit of heredity is a single nucleotide of DNA because it can alter a heritable trait. However, for the purposes of this discussion, we shall consider a gene to be a bit larger than that. We define a gene as any region of DNA that encodes a complete product or performs a specific function.
The Concept of Complementarity
The Products of Genes
DNA is a double-stranded heteropolymer and can be thought of symbolically as a continuous string of the four bases. The nucleotide base elements of the deoxyribose nucleotide molecules on one of the strands interact with their counterparts on the other strand in a precise and predictable way that is actually part of the copying mechanism. An A can only interact (through hydrogen-bond formation) with a T, and a G can only interact with a C. This simple relationship called base pairing totally defines the sequence of a second DNA strand if the sequence of the first strand is known. Thus, if a short segment of a DNA molecule has the sequence ATATAGCTCG, then the double-stranded molecule must look like this:
The products of the kinds of genes we are considering are proteins. Proteins are often enzymes-catalysts that convert one chemical compound to another-or are structural components of cells. An example of an enzyme would be one of the proteins in the pathway that enzymatically converts the simple amino acid tyrosine to the brown-skin pigment melanin. If the gene encoding the information to make this protein carries a mutation, then the protein will be absent or defective, and no melanin, or reduced amounts of melanin, will be made, leaving the individual with an albino skin color.
---ATATAGCTCG---
Strand 1
---TATATCGAGC---
Strand 2.
When DNA is replicated, or copied, a new DNA strand is synthesized using each of the old strands as templates-patterns to guide the formation of new complementary strands. But, given the rule of complementation, the sequences of the new strands are completely predictable. old
---ATATAGCTCC---
---ATATAGCTCG---
new
new
---TATATCGAGC---
---TATATCGAGC---
old
The Physical
Basis of Mutations
When the germ cells of an organism-the eggsor sperm-undergo cell division (multiplication), the DNA molecules in each cell are also copied. Although the copying mechanism that generates the new DNA strands is highly reliable, mistakes, mutations, do occur, usually at random. For example, when copying a sequence such as ATATAGCTCG one could end up with ATATACCTCG. Such a simple single base substitution,
November 198.5 Volume 28 Number 1I
How Does a Gene Carry Protein Coding Information?
Genes are regions of DNA, and proteins are the products of genes. Proteins are built from a fundamental set of 20 amino acids, and DNA carries the amino-acid coding information. If a single nucleotide base of DNA coded 1 amino acid, only 4 amino acids could be accounted for because there are only 4 nucleotides in DNA. Only 16 would be accounted for by 2 nucleotides (4 x 4) whereas 64 kinds of amino acids can be determined by 3 nucleotides in sequence (4 X 4 X 4). Therefore, a code consisting of at least 3 nucleotides would be needed to specify all 20 amino acids. In fact, a group of 3 nucleotides in sequence does code for 1 amino acid, and this 3-nucleotide or tripled sequence is called a codon. Since there are 64 possibilities for triplet combinations of four bases and there are only 20 fundamental amino acids, the code is obviously redundant and is referred to as degenerate. Several combinations of three bases are synonyms. No triplet encodes more than 1 amino acid, but many amino acids are encoded by more than one triplet, some by as many as six. In addition, three of the triplets do not correspond to any amino acid. Accordingly, when any one of these three so-called nonsense combinations occurs, it acts as a termination signal and signifies the end of the encoded protein. Transcription
and Translation
The DNA itself does not directly act as a template for the protein decoding/synthesizing machinery. What
Communicationsof the ACM
1165
Special issue
/ :::., . A9
Sugar-phosphate backbone with the variable sequence of bases attached.
0
..,i' . .,::p:: : :..
0
0
.9
The double-stranded helical structure of DNA, as first presented by James Watson and Francis Crick in 1953. Schematic Diagram of DNA
(a) In the early 1950s James Watson and Francis Crick attempted to construct a model that would fit the information already known about DNA. By piecing together various data, they were able to show that the structure of DNA is a long, entwined double helix.
1166
Commut~ications of the ACM
(b) DNA is a macromolecule made up of four kinds of bases attached to a sugar-phosphate backbone. The bases carry genetic information, and the sugar and phosphate groups perform a structural role. Although the molecular building blocks that go into the assembly of DNA are few and relatively simple-consisting of only four types, deoxyadenosine, deoxycytidine, deoxyguanosine, and deoxythymidine. usually abbreviated as A, C, G, and T-the molecules are polymers, and their combination adds large-scale complexity to any attempt to determine general patterns of organization. A can pair only with T, and G only with C. Thus the order of bases along one strand detemtines the order along the other. The strands run in opposite directions.
November 1985
Volume 28 Number 11
Special issue
5’ END
DEOXYTHYMIDINE
DEOXYTHYMIDINE
Regulation
I H NH,
DEOXYCYTIDINE
\\p/
happens, in fact, is that a complementary copy of one of the two strands of DNA is synthesized out of ribose nucleotides to generate an RNA copy of the gene in a process called transcription. This RNA copy is then decoded by the protein synthesizing machinery in a process called translation. Since this RNA carries the protein code, it is called messenger RhJA (mRNA). Transcription is highly regulated and is probably the major mechanism that distinguishes one kind of cell in an organism from another. That is, the set of genes being transcribed in one cell distinguishes the kinds of proteins made by that cell from the proteins made by another cell-a muscle cell makes contractile proteins, and a red blood cell makes globin proteins. This differential gene activity is often controlled by regulation of transcription of one set of genes versus another.
H
DEOXYADENOSINE
l l
l
DEOXYGUANOSINE
l
3’ END
The chemical structure of a portion of one strand of DNA
(c) In the repetitive sugar-phosphate sequence that forms the backbone of the molecule, each phosphate group is attached to the 5’ carbon (the fifth carbon in the ring) of one deoxyribose sugar, and to the 3’ carbon (the third carbon in the ring) of the deoxyribose sugar in the adjacent nucleotide. Thus the chain has a 5’ end and a 3’ end.
November 1985 Volume 28 Number 11
of Transcription
The transcription of an mRNA is a very precise activity. The mRNA molecule always starts at one precise nucleotide and ends at another exact point. How does the cell know when to transcribe a particular mRNA, and equally important, how are the start and stop points determined? The underlying assumption of all molecular geneticists working in the field of gene regulation is that DNA sequences, in addition to carrying the wellunderstood amino-acid coding information, also carry the transcription regulating information. What kind of information, besides the protein code, is embedded in the DNA? Among the kinds of signals presumed to exist are information telling the mRNA synthesis machinery that a gene is active; that the first nucleotide of the mRNA starts at a specific place; that the polymerization of the mRNA starting at that place proceeds linearly from only one of the two strands and moves in only one direction along the DNA; that the mRNA ends at a specific site, and that the mRNA synthesizing machinery should stop (or fall off the DNA) at that point.
Gene transcription is often highly modulated under the influence of complex factors in the environment of the cell-for example, hormones and metabolic substrates. In most cases, the regulatory signals that interact with the cellular and outside environment control transcription of the DNA. For the most part, those areas of DNA that control the rate and location of transcription have been found to lie upstream from the point of initiation of RNA transcription and upstream from the protein coding regions. In other words, these signal regions are themselves usually not transcribed into mRNA. A schematic drawing of a gene with some of these control landmarks is seen in Figure 1 (p, 1169). The existence of these signal regions has been inferred from two major kinds of laboratory experiments. When mutant organisms are found that express a normal protein, but at abnormally low or high levels, direct structural analysis of the DNA has often revealed point mutations in upstream flanking sequences. The
Communicationsof the ACM
1167
Special Issue
WA GCC GCG GCU
ala
AGA AGG CGA CGC CGG CGU
ar9
AAC AAU
GAG GAU
UGC UGU
GAA GAG
CAA CAG
asn
asp
w
9lu
cm
GGA GGC GGG GGU
9lY
CAC CAU his
AUA AUG AW ile
The 20 amino acids are coded by 61 triplet codons, with 3 additional codons serving as stop signals. The codon triplets
CUA CUC CM3 CUU leu
AAA AAG ‘YS
AUG met
WC UUU phe
CGA CCC CCG CCU pro
AGC AGU UCA UCG UCG UCU
ACA ACC ACG ACU
se
thr
UGG
UAC UAU
GUA GK GUG GUU
tw
rYr
val
UAA UAG UGA
stop
shown here are the ones that would appear in the mRNA molecule.
The Genetic Code
Proteins are built from amino acids. Shown here is the linear sequence of amino acids for a relatively small protein, the hormone insulin, one of the first proteins for which the pri-
mary structure was determined. The two chains are held together by disulflde bridges (S).
The Structure of a Protein
second kind of experiment involves the direct chemical modification of genes in the test tube using recombinant-DNA and genetic-engineering techniques applied to DNA fragments containing a gene. Laboratory
Modification
of Genes
Modified genes can carry induced point mutations or, more commonly, precisely defined deletions of segments of DNA flanking a protelin coding region. These modified DNA molecules can be reintroduced into cells, and their transcription and modulation studied. An example of such an experirnent showing the effect of a series of induced mutations is shown in Figure z (p. 1170) in which the hsp70 gene, a gene known to exist in the cells of many organisms including man, has been experimentally dissected by sequential deletions. The hsp70 gene has the peculiar property of not being active under normal circumstances, but is highly transcribed almost immediately in response to noxious stimuli such as raising the temperature of the cells from the normal 37” C of a mammalian organism to 42” C.
1166
Communications of the ACM
By deleting DNA from one end of a large fragment, a point is reached at which the gene reintroduced into cells no longer responds to heat shock. As the deletions get closer and closer to the starting point of transcription, the ability of the gene to support the synthesis of any mRNA whatsoever is abruptly diminished. Simple experiments of this kind have allowed molecular geneticists to identify potential regulatory signals in the 5’ flanking region of many animal-cell genes. Patterns in Promoters
There is enough similarity in the pattern of DNA sequences necessary for correct transcription of the hspi’0 gene and other genes (as determined by direct experimental testing) to allow some general principles to emerge. First, the control region at the beginning of the gene, called the promoter, is made up of more than one DNA signal. Some of these signals, or closely related DNA sequences, are also present in most animal-cell genes studied to date. About 20-25 bases upstream from the transcription initiation site lies a TATA box, a
November 1985
Volume 28
Number 11
Special Issue
RNA POLYMERASE RNA Translation
i Protein Major Processes in Protein Synthesis
RNATranscription(left) An enzyme, RNA polymerase, opens up the DNA, and the nucleotides are assembled into RNA using one strand of DNA as a template. The polymerization starts at a specific place, proceeds linearly in only one direction (3’ to 5’ direction) along the DNA, and ends at a specific site. RNA is similar to DNA except for two differences in its nucleotides and one difference in its structure. Instead of deoxyribose, RNA contains ribose; instead of thymine, RNA contains uracil (U); and lastly RNA does not possess a regular helical structure and is usually single stranded. Adapted from Curtis, H. Biology. 4th ed. Worth, New York, 1983, p. 298.
5’
short stretch of 6-8 bases with the sequence that gives it its name plus several additional As or Ts. Upstream a variable distance, there is often one or more CAAT boxes,also named for its resident sequence CCAAT. Additional sites appear to be responsible for the tissue-specific mode of gene expression. Fgr example, a highly specialized sequence upstream from the TATA potential control
box is required for the heat-shock response of the hsp70 gene. This can be directly’demonstrated by chemically linking this upstream heat-shock sequence to a test gene that normally is unresponsive to heat stimulation. When such a chimeric gene is reintroduced into cells, it responds efficiently to heat. A closer look at the relatively simple and short heat-shock promoter from the
mRNA coding segment
start of transcription
b-
protein start
protein coding segment
‘\
terminator
FIGURE1. A Schematic Drawing of a Gene
November 1985
Volume 28 Number 11
Communications of the ACM
1169
Special Issue
mRNA --+ --+ -+ . :--
... .
37”
I-
+
42”
Needed for expression
..........:.. ..::.:. .......................... ..: :rr:::~:::::::::::,a.irrir.!rr..r.r~::::::::~~:::::~::::: + t
(a) A gene under the control of ,a heat-responsive regulatory region can be introduced into mammalian cells. The gene enters the nuclear DNA and behaves normally. Sets of daughter cells (the new cells resulting from cell division) can be subjected to normal (37” C) or heat-shock temperature (42” C). By examining RNA from such cells by electrophoresis on gel slabs, only the RNA from the heat-shocked cells should contain a heat-shock related messenger RNA.
Place heat-shock segment in front of test gene
(b) Using this kind of mRNA assay, the gene bearing the heat-shock regulatory signals can be analyzed in structural detail. The DNA can be chemically modified by recombinantDNA technology to delete small segments of the regulatory regions. Different modified molecules are then checked for their ability to engender a heat-shock mRNA response. By generating and testing a series of overlapping DNA modifications, the location of Potential regulatory sequences can be inferred. A direct test would include the demonstration that the potential regulatory element confers a heat-shock response after it is inserted in front of an otherwise heat-shock nonresponsive gene.
FIGURE 2. The Effect of a Series of Induced Mutations on the hsp70 Gene
genes of a number of organisms has revealed a common DNA sequence pattern (see Table I). A general rule is that nucleotide sequences conserved in evolution among diverse species or present at homologous locations on many genes in the same organism represent functionally identical segments. Hence, the common pattern of heat-shock promoters observed among species as diverse as fruit flies, toads, and molds is likely to be a key element in the heatshock regulatory machinery. The most striking thing about the pattern (see Table I) :is the presence each time (or almost each time) of a spatially conserved arrangement of bases. Furthermore, these bases exhibit another feature common to DNA regulatory signals, dyad symmetry. The sequence C - - GAA - TTC -- G on one
1170
Communications of the ACM
strand reading left to right is identical to the sequence on the complementary strand reading right to left. In chemical terms this means that the two DNA strands have a region of stereochemical identity-that is, are mirror images of each other (see Figure 3). Such segments with their two-fold axis of symmetry are often the site of binding for specific protein molecules that regulate the transcription of DNA in their vicinity. Exactly how such DNA binding proteins exert their influence is not understood, but models built by analogy with simpler bacterial systems where the data are more complete suggest that some binding proteins can interfere with the ability of the transcription machinery to polymerize RNA (repression), while others can augment that activity by boosting the polymerization machinery
November 1985
Volume 28 Number II
Special Issue
TABLE I. Heatahsck efement SequeRcB
Gene Fruit fly hsp70
Distance to TATA
ATAAAGAATAlTCIA6AA CTCEAfAAATTTCTCT66 TTCTCGTTfCTTCEAfAf CCCTCSAATfTTCfCfAA I III III I 6ACTffAATSTTCTfACC ATCTCCAATTTTCCCCTC I III III I ATCCAGAA6CCTCYAGAA CTCTASAAGTTTCTASAG TTCTA6AGACTTCCASTT I III III I CCCCAfAAACTTCCAC66 CCfAACAAAATTC6AfAE T6CCCfTATTTTCTAfAT I III III I CCCfAGAAGTTTCCTCTC I III III I TTCCffACTCTTCTAEAA I III III I
hspS8
hsp83
hsp22
hsp26 Toad hsp70
hsp30
Soybean hspl7
Mold DIRS- 1
TATA
215 144 36 15
TATAAATACAEECSC
45 12
TATAAATACECCCCC
35 25 15
TATAAAACCAfACGC
147 46 26
TATAAATAECCACCS
97
TATAAATAACCfCAC
13
TATAAAAfCA6CfTC
CTCfAfAAAfCTCfC6AA CTCfCGAATCTTCCfCEA CTCfCGAAAGTTCTTCff CTCEfCAAACTTCfffTC I III III I TECCAEAAETTfCTACCA CTCffEAACfTCCCA6AA I III Ill I
204 194 139 72
TATAAATACAfCfEG
124 14
TATAAAACTCCT666
ATCCCfAAACTTCTAETT 6TCCAEAATCTTTCTEAA TTTCAEAAAATTCTAETT CCCAAEfACTTTCTCfAA I III III I
129 98 78 28
TTTAAATACCCCATf
TTTTACAATCTTCTAEAA TTCTACAACATTCCAACA I III III I C EAA TTC 6
179 169
TATAAATACTCECAC TATAAA
This table displays a number of nucleotide segments located in the putative control region of many heat-shock genes from four organisms. The heat-shock element, whose sequence was identffed as being a control element by experiments of the type shown in Figure 2, is the fruit fly hsp70 segment closest to the TATA box. The other elements are all present one or more times near known heat-shock responsive genes, but it is not dear how many of these sequences function in the heat-shock response. Pattern analysis of this type leads to the identification of potential regulatory elements, but these must be examined by direct experimentation.
--------
C-GAA-T,()-6
--------
--------
&-c;;-;;;-c
--------
4
FIGURE3. Stereochemical Images of Two Strands of DNA
November 1985
Volume 28
Number II
(activation), or even by displacing repressor molecules. In the case of the heat-shock promoter, a protein molecule with the expected binding and activating properties has been isolated from animal cells. The nature of this protein, as well as some of the few other known regulatory proteins, is that it is made up of two identical subunits arrayed in a symmetrically opposing orientation. The most plausible model to account for this organization relates to the ability of such proteins to bind to DNA strands exhibiting dyad symmetry. The
Communications of the ACM
1171
Special Issue
protein is capable of wrapping the stereochemical dyad in its symmetrical arms, hence, the importance of recognizing dyad symmetries in DNA. We have described but a few of the known types of DNA signals discovered to date. For the most part, these signals have been elucidated by a combination of experimental manipulation of the kind we have outlined, and an empirical, manual analysis of the DNA sequences implicated. However, as the library of known DNA sequences has grown from only a few thousand nucleotide base pairs to over five million base pairs, manual methods of analysis have grown increasingly cumbersome. The last decade has seen a great deal of research into the development of computational tools that can be used to recognize significant features on DNA sequences. SEQUENCE-ANALYSIS DNA-Sequence
Libraries
Functionality
Although literally hundreds of computer programs have been written over the last decade to automate BIONET
1172
is a trademark is a trademark
l
TOOLS
We have seen that the primary data of molecular biology are the DNA sequences themselves-the collection of As, Cs, Gs, and Ts that encode the information stored in genes. Determining that sequence from a given piece of DNA was a long and cumbersome process, taking on the order of man-years to elucidate a dozen or so base pairs. However, two new methodologies, known as Maxam-Gilbert and Coulson-Sanger sequencing, became practically useful in 1974. These two methods made possible the determination of :200-400 base pairs of DNA-sequence information in only a few hours. Scientists, using these sequencing methods and, often, specialized computer programs that aid in the assembly of these several hundred base-pair pieces into an entire genome, are now routinely sequencing complete viruses up to 50,000 base pairs in length. Two major, authenticated, and annotated collections of DNA-sequence data are now maintained: GENBANK,e an NIH-supported collaboration between Los Alamos National Laboratory and BBN, and the EMBL database, collected and distributed by the European Molecular Biology Laboratory. Each database contains over four million base pairs of sequence information (much of it complementary as the two databases have begun active sharing of the sequence collection work load) in standardized, computer-readable form. Access to the data is through distribution of magnetic tapes and floppy disks, direct computer-to-computer and computer-to-terminal transfer over telephone lines, and computational resources such as BIONET,e run by IntelliGenetics as an NIH-sponsored Research Resource, which provide access to both sequence-data and sequence-analysis programs for the nation’s academic molecular biologists.
GENBANK
various tasks of DNA-sequence analysis, the types of common analyses are relatively few and well understood. (The reason for hundreds of programs relates to the lack of knowledge of existing work, diversity of machines and languages, and personal views on issues of pattern-matching algorithms and I/O behavior.) The major classes of functionality are the following:
of Los Alamos of IntelliGenetics.
Communications of the ACM
National
Laboratory
and BBN.
l
l
Translation and location of potential protein coding regions. This process takes the DNA sequence 3 base
pairs at a time and produces the amino-acid sequence that would result from transcription of the DNA into RNA and from translation of the RNA into an amino-acid sequence. Normally this analysis must be done in six frames on a double-stranded DNA molecule, in each of three phases on each of the two strands of the DNA. A further sophistication of the analysis is to first look for the codon for the amino acid methionine, which initiates every protein, then determine if the nearest stop codon is at least 30 or 40 base pairs away, and only then translate the DNA. This eliminates much of the nonsense translations that result from a complete analysis. Inter- and intrasequence homology searching. This process compares a DNA sequence both to other DNA sequences and within itself to find repeated patterns of nucleotide bases. For example, locating precisely the same pattern of As, Gs, Cs, and Ts in front of 10 different genes would be highly significant evidence toward a possible control region. Finding identical sequences of DNA within the same gene is also a biologically “interesting” discovery. Two major complexities arise in homology searching. First, important homologies are not necessarily identical; often they differ by a few substitutions of one nucleotide for another or deletion of a few nucleotides. Second, in any string of several thousand letters in a four-letter alphabet, certain common patterns are bound to emerge. Evaluating the statistical rarity of these patterns (especially when they may be inexact matches) is essential to evaluation of significance. Inter- and intrasequence dyad symmetry searching.
Dyad symmetries in a sequence indicate possible regions where control proteins may bind to a DNA molecule. In addition, it should be noted that DNA, in its double-stranded form, is quite rigidly locked into a double-helical shape. However, RNA is singlestranded and free to bend and twist in solution. Dyad symmetries in DNA point toward areas of inverse complementarity on the RNA; these are areas where portions of the RNA may “stick” to themselves and form structures called hairpin loops. Many biologists theorize that these structures also play a part in control of DNA replication, transcription, or translation. Like homologies, dyad symmetries may be partial. Statistical evaluation is important. In addition, when RNA loops are theorized, approximate free energy calculations of the hydrogen bonds holding the loops together may be made as a guide to the biological reality of the loops.
November 1985
Volume 28 Number 11
Special Issue
Some Illustrations
Analysis of codon frequency, base composition, and dinucleotide frequency: Substantial redundancy exists in
of Sequence Analysis
The remainder of this section illustrates some common sequence-analysis operations. There are many programs used to perform these various tasks. We have chosen for our examples one of the most highly developed, the SEQe program of IntelliGenetics. Four DNA sequences from possible human oncogenes (genes that may cause cancer) are used to illustrate all but the final operation. HUMONCEJ and HUMONCT24 are the complete genes with control, coding, and unknown regions included; HUMONCl and HUMONCZ are the coding regions only from the two genes. The restriction enzyme mapping operation is illustrated on a small, artificially constructed piece of DNA used to link larger molecules together (see Figures 4-q).
the genetic code. Different organisms or different genes within organisms seem to favor certain threeletter codons over others in nonrandom ways. Complete analysis of the distribution of these three-base patterns (in all three phases and both strands), as well as analysis of relative frequency of the four nucleotides taken individually and as pairs, is often requested by scientists searching for meaningful signals in DNA. Locating AT- or GC-rich regions. DNA molecules vary in the relative prevalence of AT and GC base pairs. Certain theories have been propounded about the meaning of “richness” of one over the other, and automated search for such regions (with the definition of richness under the control of the biologist) is considered a useful feature of sequence-analysis programs. Mapping of restriction enzyme sites. Crucial to the success of modern experimental molecular biology was the discovery of enzymes called restriction endonucleuses, which were found to cut DNA selectively at patterns of four, five, or six bases (say the pattern ATTA or ATCGAT). Mapping of these sites on a molecule is considered automatic before manipulation of that molecule.
MOLGEN: APPLICATIONS OF AI TO MOLECULAR GENETICS Historical
Background
The initial aim of the MOLGEN project was to provide intelligent assistance to the scientist designing an experiment in a biological laboratory. The major AI issue was planning-that is, the development of a design for a svnthetic or analvtic experiment before the experiSEQ is a trademark of IntelliGenetics. SEQ was originally written in the MOLGEN project at Stanford University and licensed to IntelliGenetics where it has been significantly improved and enhanced over the last five years.
HUMONCEJ The
translated
sequence
is 54
2-l CCC
GGG
CCG
CAG
GCC
CTT
GAG
GAG
CGA
TGA
CGG
AAT
ATA
AGC
TGG
TGG
TGG
TGG
Gly Pro Gln Ala Leu Glu Glu Arg . Arg Asn Ile Ser Trp Trp Trp Trp Pro Gly Arg Arg Pro Leu Arg Ser Asp Asp Gly Ile . Ala Gly Gly Gly Gly Arg Ala Ala Gly Pro Gly Ala MET Thr Glu Tyr Lys Leu Val Val Val. Gly
Pro
108
81
GCG CCG TCG GTG TGG GCA AGA GTG CGC TGA CCA TCC AGC TGA TCC AGA ACC ATT Ala Pro Ser Val Trp Ala Arg Val Arg . Pro Ser Ser . Ser Arg Thr Ile Arg Arg Arg Cys Gly Gln Glu Cys Ala Asp His Pro Ala Asp Pro Glu Pro Phe Ala Val Gly Val Gly Lys Ser Ala Leu Thr Ilo Gin Leu Ile Gin Asn His Phe 162
135
TTG TGG ACG AAT ACG ACC CCA CTA TAG AGG TGA GCC TGC GCC GCC GTC CAG GTG Leu Trp Thr Asn Thr Thr Pro Leu . Arg . Ala Cys Ala Ala Val Gln Val Cys Gly Arg Ile Arg Pro His Tyr Arg Gly Glu Pro Ala Pro Pro Ser Arg Cys Val Asp Glu Tyr Asp Pro Thr Ile Elu Val Ser Leu Arg Arg Arg PrO Gly'&La 216
189
CCA GCA GCT GCT GCG GGC GAG CCC AGG ACA CAG CCA GGA TAG GGC TGG CTG CAG Pro Ala Ala Ala Ala Gly Glu Pro Arg Thr Gin Pro Gly . Gly Trp Leu Gin Gln Gln Leu Leu Arg Ala Ser Pro Gly His Ser Gln Asp Arg Ala Gly Cys Ser Ser Ser Cys Cys Gly Arg Ala Gin Asp Thr Ala Arg Ile Gly Leu APa Ala &La
Here we have a translation in three phases of a section of one of the strands of a possible oncogene. Putative starting sites are indicated by “MET” and stop signals by “.“. Note the tong open reading frame (possible coding region) that
begins at position 27 and continues to the end of the sequence. The translation uses the three-letter code for amino acids; the SEQ program also allows translation in the alternative one-letter code.
FIGURE4. Translation
November 1985
Volume 28 Number II
Communications of the ACM
1173
Special Issue
HUMONCl The
--
HlJMONC2
regions 'I .I
of homology
in the
two sequences
are
ATGACGGAATATAAGCTGGTGGTGGTGGGCGCCGTCGGTGT(~GGCAAGAGTGCGCTGACC ATGACGGAATATAAGCTGGTGGTGGTGGGCGCCGTCGGTGTGGGCAAGAGTGCGCTGACC ****
*
ATCCAGCTGATCCAGAACCATTTTGTGGACGAATACGACCC(~ACTATAGAGGTGAGCCTATCCAGCTGATCCAGAACCATTTTGTGGACGAATACGACCC(~ACTATAGAGG-ATTCCTA *
***
**
l
**
**
GCG-CCGC-CGT--CCAGGTG CCGGAAGCAGGTGGTCA-TTG % = 88.652 P( 141, 125:)
= *
136 139
.OOOE+OO *
l
*
*
E =
.OOO
*
***
*
*
270 GGAGAGGAGGGGG-CAT-GAGGGGCATGAGAGGTACC 122 GGA-AGCAGGTGGTCATTGATGGG---GAGACGTGCC % = 70.270 P( 37, 26) == .195E-06 E= .062 *
*
*
**
304 154
*
275 GGAGGGGGCATGAG-GGGCATGA 37 GGTGTGGGCAAGAGTGCGC-TGA % = 73.913 17) == .306E-04 E = p( 23,
296 58 9.757
This is a comparison between the two theorized coding regions of the possible oncogenes. Dashes within the sequences indicate possible deletions; asterisks above the two sequences indicate mismatches. The printout also shows percent of total match, the a priori probability (the P value) of the match at any given location on the sequence, and the
total number of expected matches (the E value) of that likelihood or greater for the entire comparison. The E value provides heuristic guidance as to the potential biological significance of the match. E values greater than 1 normally indicate an “uninteresting” homology.
FIGURE5. Homology Search
ment was actually carried out. Two planning systems resulted: one based on a cognitive study of how humans design experiments; the other relying more on computer science theoretic grounds on how to optimize human performance. Both systems showed substantial promise for further development as practical tools. The MOLGEN project also made progress in knowledge representation (how to effectively store diverse forms of biological knowledge in an intelligent computer system) and in knowledge acquisition (how to get that knowledge from human scientists in the first place). This research resulted in the development of a knowledge-engineering tool called the Unit System and formed the basis for KEE,e a fully supported and documented commercial tool that is being used in current MOLGEN work (see below). A basic lesson learned during research in the HPP is that relatively simple reasoning or inference methods could be most effectively applied to problem solving when they were accompanied by a large amount of diverse expert knowledge. For this reason, projects in the HPP usually go through a life cycle of early emphasis on AI research, followed by a gradual transition to
1174
the particular domain of real-world expertise involved in the project at hand. In 1982 we decided that MOLGEN had reached a state of development where the basic research goals in planning and design had been satisfied. Therefore, a search was begun for new problems of interest within the molecular biology domain. This search led to the second level of discovery problems discussed in this article, the task of scientific theory formation. Research in Scientific
Theory
Formation
KEEis a trademark ofIntelliCorp.
MOLGEN research in scientific discovery consists of several major phases. The first phase, which we have recently completed, was a detailed study of the reasoning processes used by Charles Yanofsky and colleagues during a 1%year endeavor to discover the regulatory processes operating in the tryptophan (trp) operon system of E. coli. The second phase, now under way, is to build a knowledge base that adequately represents the currently accepted theory of the biological systemadequacy here referring to the ability to explain laboratory observations and to predict logical consequences resulting from the theory. The third phase, now in the planning stage, will involve creating the reasoning
Communications of the ACM
November 1985
Volume 28 Number I1
Special issue
mechanisms that permit theory modification, extension, and occasionally complete re-creation-all of this still within the context of the Yanofsky trp-operon system. Finally, we hope to generalize and verify our results by applying the theory-formation system to a different biological system for which a verified regulatory theory is yet unknown. Initial
pers in refereed journals. Dozens of hours were spent questioning Yanofsky about speculation, insights, and blind alleys that occurred during the project. In addition, complementary and sometimes differing points of view were obtained from Imamoto, Hiraga, Lee, and Zurowsky, among others. To review the history of the trp-operon project very briefly, Yanofsky’s group started with the view that the genes that code for proteins in the tryptophan biosynthetic pathway (the proteins that are used by a cell to synthesize the amino acid tryptophan) were controlled by classic Jacob-Monod regulation-a model based on the regulatory mechanism for the Zacoperon of E. coli (Figure 10, p. 1178). A well-defined promoter and operator were also found preceding the genes in the E. coli trp operon, and a repressor protein was identified. However, over the course of studying the behavior of the system, various anomalies were discovered that could not be explained
Problem Analysis
The first six months of the MOLGEN discovery project were spent in attempting to understand the scientific reasoning that went into building the theory of genetic regulation in the trp-operon system. This study involved both extensive literature research and personal interviews with most of the key scientists involved in the Is-year project. Yanofsky’s group has been uncommonly thorough in documenting both the methodology and the results of their project-our literature study involved the detailed analysis of over 50 scientific paHUMONCT24
The
of
regions
dyad
IIIIII 3383 % = 71.429
28,
are
CCGCCCCTGCCGGTCTCC-TGGCCTGCG
3414
P(
symmetry
II
:III:Il
3440
I :IIIIII
GGCGGG-ACCCTCAGGGGGAGTGGACGC
20)
= .939E-04
3357
E = 1.647
G = -37.300
GG-ACATGG-AGG-TGCCGGA-TGCAGG-AAGGAGG
3480
II
III
II
III
III
II
:IIIII
3510
IIIII:
3461 CCGTGTTCCCTCCGACG-ACTGGCGTCCGGTCCTCT 3427 % = 72.222 P( 36, 26) = .lOlE-04 E = .178 G = -33.300
GAGG-TGCCGGA-TGCAGG-AAGGAGG--TGCAG
3487
IIII
III
II
:IIIII
lllll:
3515
IIII
3452 CTCCGACG-ACTGGCGTCCGGTCCTCTGGCCGTC 3420 % = 70.588 P( 34, 24) = .121E-04 E = .212 G = -30.037
CGGATGCAGGAAGG-AGG-TGCAGACGGAAGG--AGGAG
3494
III
III
III1
III
III
III
I III
3464 GCC-CCGT-GTTCCCTCCGACGACTGGCGTCCGGTCCTC % = 71.795 28) = .423E-05 E= .074 G = -36.800 P( 39, .
IIII 3587 % = 71.053
P(
38,
.
III
l:III
II
Ill
I l:l:I
= .587E-05
These are the regions of intrasequence homology in one of the complete possible oncogenes. Bars between the two sequences show potential base pairing by hydrogen bonds (A-T or G-C); a colon shows G-T pairing that is energetically very slightly favorable. In addition to the P and E values
E = .103
3650
I III
AGGGCCCCACTGACCCGAGGTC-GTCGGGA-AGGAAGG
27)
3428
.
TCCCAGGGAGGCTGTGC-ACAGACTGTCTTGAACATCC
3614
3528
IIIII
3552
G = -17.600
from the homology search, a G value shows a rough estimation of free energy contribution of the possible RNA hairpin loop. G values that are -14 or less indicate possibly stable structures.
FIGURE6. Dyad Symmetry Search
November 1985
Volume 28
Number 11
Communications of the ACM
1175
Special issue The
codon
frequency
phased
from
position
1
HUMONC2
TTT-Phe TTC-Phe TTA-Leu TTG~Leu
3 ( 2 (
CTT-Leu CTC-Leu CTA-Leu CTG-Leu
0 ( 0.0)
ATT-Ile ATC-Ile ATA-Ile ATG-MET GTT-Val GTC-Val GTA-Val GTG-Val
1.6) 1.1)
0 ( 0.0) 2
(
1.1)
TAT-Tyr TAC-Tyr TAA- . TAG- .
1 ( 8 (
1 ( 2 (
0.5) 4.2)
0 ( 0.0) 0 ( 0.0)
CCT-Pro CCC-Pro CCA-Pro CCG-Pro
2 4 0 0
( ( ( (
1.1) 2.1) 0.0) 0.0)
CAT-His CAC-His CAA-Gln CAG~Gln
11
0.5) 4.7) 0.5) 2.6)
ACT-Thr ACC-Thr ACA-Thr ACG-Thr
2 6 0 3
( ( ( (
1.1) 3.2) 0.0) 1.6)
AAT-Asn AAC-Asn AAA-Lys AAG-Lys
5 ( 1 ( 10 (
2.6) 0.5) 5.3)
0 ( 0.0)
GCT-Ala CCC-Ala GCA-Ala GCG-Ala
2 7 1 1
( ( ( (
1.1) 3.7) 0.5) 0.5)
GAT-Asp GAC-Asp GAA-Glu GAG-Glu
6 9 3 11
3.2) 4.7) 1.6)
2
(
1.1)
0 ( 0.0) 10 ( 5.3) 1 9 1 5
2
( ( ( (
(
1.1)
0 ( 0.0) 15 ( 7.9)
This is a codon-frequency table of the coding region of one of the possible oncogenes (shown for the phase believed to be the ceding phase). Note that some of the distributions
0.5) 1.1)
0 ( 0.0) (
5.8)
TGT-Cys TGC-Cys TGA- . TGG-Trp
3 3 1 0
( ( ( (
1.6) 1.6) 0.5) 0.0)
CGT-Arg CGC-Arg CGA-Arg CGG-Arg
0 ( 0.0)
( ( (
( 5.8)
AGA-Arg AGG-Arg
0 ( 0.0) 1 (
0.5)
GGT-Gly
1 7 1 3
0.5) 3.7) 0.5) 1.6)
GGC -Gly
GGA-Gly GGG-Gly
( ( ( (
seem close to random (say the choice of codons for the amino-acid Serine (Ser)), but some seem very nonrandom (look at Valine (Val) or Leucine (Leu}).
FIGURE7. &Jon-Frequency Determination
by even severe modification of the Jacob-Monod model. These experimental discoveries led to the proposing, testing, and evolution of a new regulatory model. Yanofsky concluded that transcription of the iv operon was regulated by a controlled termination site called an attenuator that is located between the operator and the gene for the first enzyme in thle pathway. Attenuation, as the process is called, is a radically new regulatory mechanism in two major respects. First, secondary structure (hairpin loops) on the messenger RNA acts to control when transcription is stopped. Certain loops shape the RNA so that the polymerase “falls off.” Second, it involves a coordinated interaction between transcription and translation. The ribosome, previously thought only to functison as the machinery used by a cell to produce an amino,-acid sequence from an RNA template, functions to “d.ecide” which loops form in the RNA by where it “stalls” on the molecule. (See Figure 11, p. 1179, for more details.) Our analysis of the Yanofsky group’s research had two major goals: to collect the knowledge of biological structures and techniques studied and employed during the research, and to elucidate something of the reasoning process used by scientists during the theoryformation process. We believe that the first goal has been accomplished; this will be verified as we complete our simulation knowledge-base building task described below. We have also made substantial progress in our study
1176
Communications of the ACM
of how humans perform the task of scientific theory formation. There is concrete evidence of at least four types of reasoning: Data-driven reasoning. The process by which empirical evidence directly leads to extensions and modifications in a regulatory theory. This form of reasoning was especially evident in the early stages of the trp-operon project when Yanofsky began by considering Jacob-Monod regulation in the lac operon of E. coli. 2. Theory-driven reasoning. The process by which the growing theory itself suggests logical further extensions to the theory. For example, when it was conclusively shown that the leader region of the operon had a definite regulatory function, the logical conclusion was that a regulatory protein binding site had to exist in that region. This led to a great deal of experimental work to find the site. 3. Analogy to other biological systems. The process by which related work on other biosynthetic operons contributes to developing the regulatory theory. For example, parallel work on the histidine operon (another operon for the biosynthesis of amino acids in E. coli) in Kanai’s laboratory both confirmed parts of the Yanofsky attenuation theory and suggested extensions to that theory. 4. Analogy to distantly related systems. The process by which seemingly unrelated systems can help in the 1.
November 1985
Volume 28 Number II
Special Issue
theory-construction process. For example, some find it helpful to think of a DNA strand as a railroad track and DNA polymerase as a railroad engine moving down that track. Regulatory theories can be proposed by imagining ways in which the railroad engine can be stopped, diverted, or slowed down. Although this particular analogy sounds a bit frivolous when applied to regulatory genetics, it was interesting to note that Alexander Rich of MIT also found it useful when discussing control of gene transcription: “It’s as though the DNA is a railroad track and the polymerase zips along it until it finds a promoter . . .” (Z-DNA moves toward real biology. Science 222, 4623 (Nov.
4, 1983), 496).
HUMONCT24 LIMITS: The
CG rich
1
regions
1610
We believe that it is significant that the various scientists involved in the trp-operon project made different uses of the reasoning methods just discussed. Yanofsky himself maintained a global view of the entire emerging system and tended to be very theory driven in his work. Other scientists graduate students and postdoctoral scholars tended to concentrate on much narrower parts of the system and were more data driven in their reasoning. The use of analogy was a very personal matter, both in choice of analogies and in the role they played. Our overall conclusions from this study of theory formation are that, first of all, the basic inference processesare indeed understandable and potentially simu-
1669
are
1620
1630
1640
1650
GGGcccTccT TGGCAGGTGG GGCAGGAGAC CCTGTAGGAG GACCCCGGGC 1660
CGCAGGCCCC TGAGGAGCG HUMONC2 The
CG rich
510
reqions
are
520
GCTGCGGAAG CTGAACCCTC 560
530
540
550
CTGATGAGAG TGGCCCCGGC TGCATGAGCT
570
GCAAGTGTGT GCTCTCCTGA
This figure examines both the noncoding (from bases 1 to 1669 of HUMONCT24) andcodingregions (HUMONCZ) ofa possible oncogene for CG richness. The biologist has defined richness to mean at least 75 percent CG composition for at least eight bases in a row. Note that the very beginning of
the gene (bases 1 to 15 0) and the region immediately in front of the coding region (bases 16 10 to 1669) are very heavily GC rich. The gene itself has only a few regions of moderate GC richness.
FIGURE8. Location of AT- and GC-Rich Regions
November 1985
Volume 28
Number I1
Communications of the ACM
1177
Special Issue
Linear
M13MP5-LINKER
Length HinfIII NlaIII
1
EcoRI' EcoRI
16 14
EcoR
21
24
HpaII
EcoR
I-
I
45
48
18
EcoRI' EcoRI'HpaII
EcoRI
I
I
I-
I
EcoRI' HpaII
EcoRI
I-
I
37
40
I
ATTCCGGAATTCCGGAATTCCGGA48 26
29
32 34
42
EcoRI‘
Hind111 AluI
EcoRI'
EcoRI
I 49
HpaII
ATGACCATGATTACGAATTCCGGA24 10
25
= 78
EcoRI' HpaII
EcoRI
I‘
I
70
73
I
I
ATTCCCCAAGCTTGGGAATTCCGGAATTCA78 50
59
65
57
75
67
the place on the molecule’where enzyme.
This figure shows the restriction enzyme map of a small artificial linker molecule used to connect two larger pieces of DNA. The enzyme symbols (ECORI, for example) point to
it would be cut by the
FIGURE9. RestrictionMapping
OPERATOR PROMOTER --
GENES CODING FOR PROTEINS
‘c
CAP SITE
REPRESSOR DIRECTION SITE I RNA POLYMERASE ENTRY SITE OPERATOR “OFF”
RNA POLYMERASE
OF TRANSCRIPTION--)
\ REPRESSOR
The Jacob-Monod model states that transcription, the step in protein synthesis in which a messenger RNA is built from a DNA template, begins when an enzyme called polymerase binds to a specific spot in the front of the gene called a promoter. Another molecule called a repressor, which is made elsewhere in the cell and is activated by an excess of
the end product of the gene (in the Jacob-Monod
model,
lac), can bind to a nearby site on the DNA called an operator. If it does so, it blocks the polymerase from binding to the promoter, thereby stopping the protein-producing machinery at the very first step.
FIGURE10. Thelac Operonof E. co/i
1170
Communications of the ACM
November 1985
Volume 28
Number 11
Special issue
Formation of this stem and loop results in the termination of transcription
A. High tryptophan
level
Ribosome is stall
B. Low tryptophan
When tryptophan is abundant (A), the leader region (segment 1) of the rrp mRNA is fully translated. Segment 2 interacts with the ribosome, which enables segments 3 and 4 to base pair. This base-paired region somehow signals RNA polymerase to terminate transcription. In contrast, when ttyptophan is scarce (B), segments 3 and 4 do not interact because the ribosome is stalled at the rrp codons of segment 1. Segment
level
2 interacts with 3 instead of being drawn into the ribosome, and so segments 3 and 4 cannot pair. Consequently, transcription continues. (After D.L. Oxender, G. Zurawski, and C. Yanofsky. Proc. Nat/.Acad. Sci. 76 (1979), 5524.) Adapted from Stryer, L. Biochemistry. 2nd ed. W.H. Freeman and Co., San Francisco, Calif., 1981, p. 677.
FIGURE11. Model for Attenuation in the E. co/i frp Operon
latable within a computational system, and, second, a problem-solving architecture that provides for multiple forms of reasoning, each contributing to an overall growing solution, is best for our initial consideration. The blackboardarchitecture has been developed by AI researchers for just this sort of situation, where contributions from multiple sources of knowledge are used jointly to solve a complex problem (Figure 12, p. 1180). Overall
System Architecture
In simplest form, the MOLGEN scientific discovery system can be thought of as containing two components: a performance element and a learning element. The performance element includes a computational model of how nature operates in the biological system being studied. In the case of the trp-operon work, it represents the current best view on the mechanism of genetic regulation. The model encompasses both structural knowledge (factual information about each of the entities important to regulation) and functional knowledge (how the various entities affect each other and the system as a whole). The structural and functional
November 1985
Volume 28
Number 1 I
knowledge taken together form the knowledge base of the trp-operon system. In addition, the performance element includes inference methods that allow the knowledge base to be used to simulate the activity of the’trpoperon system in response to various internal and external conditions. The learning element of the discovery system evaluates how well the performance element can be used to model the actual biological system of concern. It compares the result of simulations to laboratory data and decides when and how to modify the performanceelement knowledge base (perhaps using the blackboard architecture previously discussed). A change to the knowledge base, in the form of an addition, deletion, or alteration to either the structural or functional knowledge, can be said to be a discovery. The bulk of our work so far has concentrated on the performance element of the MOLGEN discovery system. The major computational problems we are attempting to resolve are in qualitative simulation (where the goal is to predict generic behavior and trends rather than numeric quantities) and in making sure that the
Communications of the ACM
1179
Special Zssue
THE “PURE” BLACKBOARD
PARADIGM
I
KNOWLEDGE
SOURCE, I
a 0
BLACKEiOARD
a
I
KNOWLEDGE
SOURCE,
I
Blackboard THE GLOBAL DATABASE All problem-solving state data Knowledge Sources (KS’s) Data-triggered agents Respond to changes in blackboard May include rules, procedures, tables, etc. Modify blackboard state May do I/O Both domain and control knowlege KS’s Protocols KS’s modify only the blackboard Only KS’s modify the blackboard KS’s are “procedurally” independent
FIGURE12. The “Pure” Blackboard Paradigm
knowledge base is “transparent” enough so that both structural and functional information can be “understood” and modified by the learning element. Knowledge-Base
Construction
The trp-operon system knowledge base needs to contain at least the following classes of knowledge: knowledge about transcriptional and translational mechanisms and objects (operators, promoters, repressors, ribosomes. etc.); knowledge about the trp operon in particular (specific DNA structures in the operon, enzymes found in the system, etc.); knowledge about the laboratory methods used to discover information about the trp-operon system (nutrition experiments, mutation experiments, sequencing experiments, etc.); heuristics relating to regulabon, biology in general, and several potential analog systems (including, for example, the railroad tracks analogy discussed earlier). l
l
l
l
1180
Communications of the ACM
We have completed a substantial portion of the construction of the knowledge base. The work has been done using the KEE System, a product of IntelliCorp, on the Xerox 1108 Scientific Information Processor. KEE was selected as the best currently available framebased, hierarchical, object-oriented knowledge acquisition and representation tool. Each of these three concepts will be discussed below. The Xerox 1108 is a high-performance scientific workstation with a large (over z MB) main memory, a microcoded processor optimized to the Interlisp programming language, and a double-width bit-mapped display. Three major paradigms have emerged as representational frameworks for knowledge-based systems: logic based, rule based, and frame based. Logic-based systems acquire and store information as statements in the predicate calculus. They are especially useful when knowledge is easily formalizable, and relatively consistent and complete. Rule-based systems acquire and store knowledge as production (IF-THEN) rules. They are especially useful when the bulk of the knowledge is
November 1985
Volume 28
Number 11
Special Zssue
heuristic and procedural. Frame-based systems acquire and store knowledge within blocks called frames that each represent an entity (either structural or functional) in the knowledge base. A given frame can have many different attributes (commonly called slots). Frame-based systems are especially useful when the knowledge is a mix of factual and procedural information, as is the case of our work in scientific discovery. Frames are commonly organized into a hierarchy of classes, subclasses, and individual entities in the knowledge base. The major power of the hierarchy comes from the concept of inheritance. This means that attributes (slots) of a frame are passed down to its children, either directly in the form of values for that attribute or indirectly in the form of knowledge about what constitutes “legal” values for the attribute. For example, consider the small knowledge base shown below: ANIMALS
A,‘\ FISH
MAMMALS
/\ SHARKS
/\ TROUT
DOGS
ELEPHANTS
BOWSER The ANIMALS frame has attributes like HEIGHT, WEIGHT, SKIN COVERING, and DOMESTICITY. None of these attributes are “filled in” with actual values (since the class of entities called ANIMALS does not have fixed values for those attributes), but certain restrictions on the values are known. For example, it may be defined that WEIGHT is a numerical quantity measured as a positive number of grams and that DOMESTICITY is either TAME or WILD. The MAMMALS unit inherits these attributes and might assign specific values, further narrow given restrictions (for example, by stating that all MAMMALS have WEIGHTS greater than 200 grams), or define new attributes common only to MAMMALS (say a CHILD-GESTATION-TIME attribute). This process continues through the hierarchy until an actual instance of an entity is defined, at which time all of the attributes are given specific values. So the BOWSER frame will have a DOMESTICITY of TAME, a WEIGHT of 15,050 grams, etc. In real-world knowledge bases, inheritance becomes considerably more complex than the situation just described. Restrictions sometimes become only guidelines for values (e.g., a penguin is a bird except it cannot fly). Sometimes it is useful for slots to have several values up to some maximum number of values. Finally, the ability to manage a knowledge-base organization where a frame can have several parents adds a great deal of
November 1985
Volume 28
Number II
expressive power since an entity can be viewed from several different perspectives in several different contexts. The following figures (Figures 13-15, pp. 1182-1185) will illustrate some of these features in the KEE trp-operon knowledge base. Functional Knowledge We previously noted that a knowledge base typically contains both factual and functional information, in other words, both “what” and “how” knowledge. KEE is an object-oriented knowledge representation system; this means that to the system user both types of knowledge are stored and retrieved in an identical manneras the values of slots within units. For example, take the slot we saw above, the switch state of a molecular switch. For a given molecular switch, the person constructing the knowledge base might specify that it was always “on.” But that would not be a very sensible or flexible way to define the property. It is much more interesting to define the switch state in terms of its functional relationship to the current molecular environment. The functional knowledge can be specified either as a procedure in the LISP programming language or within an English-like rule language supplied with KEE. In addition, we are constructing a biologicalspecific process language that will be translated into the more general KEE rule language. Whatever the form of the functional knowledge, it can then be “attached” to a given slot. In the case of switch state, a set of rules relating switch state to cofactor state, switch structural integrity, and the like forms this attached piece of functional knowledge. When the user asks KEE to retrieve the switch state of a particular molecular switch, if that value is not explicitly resident in the switch-state slot, the system notes the attached procedure and attempts to determine the value of switch state given the current state of the knowledge base. A full discussion of functional knowledge is beyond the scope of this overview article. We suggest that interested readers peruse the Fikes and Kehler article referenced in the Further Readings section. Simulation of the trp-Operon System Simulation of the trp-operon system is a simple process once the knowledge base is complete and accurate. Two different inference methods, forward and backward chaining, are employed. In forward chaining, a fact is stated, say, that an experiment has found a destructive mutation in the promoter region of the operon, and the system attempts to determine the ramifications of that fact. It does so by finding all rules or other procedural knowledge that includes the state of the promoter in the conditional parts of their reasoning. That knowledge is applied, possibly resulting in further changes to the knowledge base. The system recurs with these new facts driving forward inference. The cycle continues until no new implications can be found. The state of the knowledge base after this process represents the system’s prediction of the effects of a destroyed
Communications of the ACM
1181
Special issue
SWITM.RINDING.SITES MOLECLKARJ;W(TC”ES COFACTORS A TRP-SYNMETASEALPHA 1Rd~SYNTHETASEBET.A PLXYPEPTLOES -ANiiRiNLLAlE-SYNTHETASEJXMPONENTJ ANitiRiNILAlE-SYNTHElASE.COMPONENT1 TRP%6LYPEPTlOE- - - - - - lRP-R.PCLYPEPTIOE.1 I TRP-R.OPERATORBlNOlNGSITE- - - - - - TRP-R.OPERA,OR.R,NOINCSnE. 1 PROTMBINOINCS(TES< TRP-R.TRPBINOINOSITE- - - - - - TRP-R.TRPBINOINOSITE. 1
SPACERS-
MISCMOLECULES
TRP.SPACER.l
SHIKIMATE-6-PHOSPHATE ANT”RANlLArE RJOOLE-GLYCEROL-PHOSPHATE
Each word in this diagram represents a single frame (also called a unit in KEE) in the tryptophan-operon knowledge base. The most general units are on the left of the diagram; the most specific ones on the right. A solid line connecting two units, for example, from PROTEINS to ENZYMES, indicates a class-subclass relationship. A dotted line indicates a prototype-instance relationship; for example, TRP-R.l is a specific model of the trp repressor created during a single
FIGURE
13. The Current Hierarchy of Part of the frp-Operon Knowledge Base
promoter region. Forward chaining can be viewed as “data-directed” reasoning. Backward chaining occurs when a question like “What could cause unregulated RNA synthesis to occur?” is asked. The system looks at the conclusion portions of all functional knowledge to find where that inference could possibly be made. If it finds any, then it checks the conditional portions of the knowledge to see if it is already true. For conditionals whose value is not yet known, it asks what would make those values true
1182
Commut~icafiom of fhe ACM
simulation run. Note that units can have more than one parent in the knowledge base; this is very important as it allows individual entities to be viewed in several different perspectives. For example, OPERATORS are both DNA.SEGMENTS (viewing them in their structural sense as pieces of DNA) and MOLECULAR .SWITCHES (viewing them in their functional sense as control sites that can be used to turn a process on or off).
and recurs. This cycle continues until all necessary conditions have been determined or it has been shown that the proposed state (in this case unregulated RNA synthesis) could never occur in the current model. Backward chaining can be viewed as “theory-directed” reasoning.
Further Research The preceding has described our efforts to construct the performance element of our discovery system. When
November 1985
Volume 28
Number 11
Special Issue
we are satisfied with its ability to accurately model Jacob-Monod repression in the trp operon, we will proteed with development of the learning element of the system. To accomplish this, we intend to pursue the following specific tasks: 1. Build an interface that allows the simulator to be used to explain observations that are indeed explainable without changes to the current model. For example, “I have observed a mutation that causes constitutive (uncontrolled) production of
m1
-
nm
l
tryptophan. How can that be explained within the Jacob-Monod model?” 2. Begin to recognize when observations are “interesting.” Interesting here has one of the following broad meanings: l
l
A seemingly direct contradiction to the existing theory. A statistically rare occurrence (one that is understandable by the current theory, but one that should not occur very often).
I =
Jn i t. : MOLECULAR.SWITCHES in know 1 edge base 3-May-85 16:12:46 2reated hy K&RF on tiodified by FRIEOLAND on 27-Jun-85 14:19:44 :$-1perclasses: FUNCTIONAL .ENTITIES
TRP-SIM. 1
’ s l-1b c:1 a s s e s : REGULONS J PROMOTERS 1 GENES ? OPERATORS, Member of: (CLASSES in KE GENERICUNITS ) leniberSlot: Inheritance ‘4 ,a 1 IJ e s
:
\lemberSlot:
Inheritance
*COFACTOR from MOLECULAR.SWITCHES : OVERRlDE.VALUES U ti kn 0 w n
*COFACTOR.POLARITY : OVERRlDE.VALUES
f t=o 111MOLECULAR .SWITCHES
'v'a 1 IAe C 1 ass : ( 0 NE , 0 F A 11: T I ‘4 ATE Values : Un k n 0 wn flemberSlot: Inheritance Values:
REPRESSORS
ClE&I: T I V A T E :I
:~COFACTOR.SITES f r 111 m MOLECULAR .SWITCHES : OVERRlDE.VALUES U n k n c)w n
'IlemberS 1 gt : %WITCH.STATE f ram MOLECULAR.SWITCHES I nher i tance : OVERRlDE.VALUES V a 1 1-1e 121 a s:s: : ( 0 NE , 0 F 0 N 0 F F ) Cardinality .Miti: 1 Cardinalit.y.Max: 1 Va lcres : Unkncrwn
This display of the MOLECULAR. SWITCHES unit first provides some information about the unit itself: its creator, last modifier, parent (superclass) in the hierarchy, and direct children (subclasses). Then a few of the slots, which provide knowledge about the concept of molecular switches, are shown. This distinction between knowledge about a unit (sometimes called bookkeeping knowledge) and knowledge about a concept or structure represented by the unit is sometimes confusing, but always important. Each slot has a name, the name of the unit in the hierarchy in which the slot is either defined or last changed, an inheritance mode, and a value. In addition, a slot may have various kinds of restrictions on its possible values, for example, a list of possible values, a numerical range, or a specification
that a slot must be “filled” with the name of another unit in the knowledge base. All of the slots shown in this figure were originally defined in this unit and have an inheritance mode of OVERRIDE.VALUES (this means that children of this unit will inherit any values defined here unless they choose to provide specific, more narrowly specified values that still fit the slot’s restrictions). The COFACTOR. POLARITY slot has been given the restriction that a legal value must be either ACTIVATE or DEACTIVATE (or both since in some rare cases a molecular switch can be activated or deactivated by the same cofactor). The SWITCH .STATE slot is restricted to precisely one value, ON or OFF (that is the purpose of the “cardinal&y” specification).
FIGURE14. A Portion of the MOLECULAR.SWITCHESUnit
November 1985
Volume 28
Number II
Communicationsof the ACM
1183
Special Issue
+I
-I
‘I
I
II
: REPRESSORS in know 1 Gdye base TRP-SIM. 1 Zreatad by KUUND on 22-&pr-85 15:49:613 ilodified by liAP,P on lQ-May-85 14:18:58 S~uperc 1 asses : MOLECULARSWITCHES, PROTEINS SI.J~~ lasse:z : TRP-R (CLASSES i n K f3 GENERICUNITS ) Member of: Jn it
“ACTIVE .SITES from REPRESSORS Inheritance : OVERRIDE.VALUES va 1c1e s : Un k m0 wn
i!GmbGrSlot:
lemberSlot: ‘%UFACTOR f r rJm MOLECULAR.SWITCHES I nher i t.ancG : OVERRlDE.VALUES v 3 11-J G S : Un k I?o wn :kCOFACTOR.POLARITY Inhari tance : OVERRIDE.VALUES ValllGCla:S;3: (ONE.OF ACTIVATE va IlUGS : ACT:IVATE I
dGmbGrSlOt.:
from
DEACTIVATE)
*COFACTOR.SITES from I nher i t.ance : OVERRIUE.VALUES v.3 1lsJGS : U n kn c)lwn
MOLECULAR.SWITCHES
dGmbGrSlot:
lemberSlot:
Inhwi
tancG
VZI~LJGC~FISS:
REPRESSORS
*C13MPONENT.POLYPEPTIOE.RATIOS : OVERRlDE.VALUES LIST
from
PROTEINS
V a 1 I-JGs : U n k n 0 wn 1Gmk G rS 1 ot : *CUMPONENT.POLYPEPTlllES
f ram PROTEINS
Inheritance : OVERRlDE.VALUES 'da lues : Unknown 1GnlbGrSlOt: *COMPONENTS from PROTEINS I nher i tancG : OVERRlDE.VALUES v 3 1 U Gc 1 a S S : (ONE.OF TRP csi32) ValiJGs: Unknown !GmbGr:~lot: Inheritance: ValueClass:
*CONCENTRATION OVERRIDE
Inheritance: ‘v’ a 1 1-JG c 1 3 3
ru PHYSICAL.ENTITIES
( ONE , 0 F H I GH L OV )
Cardinaliry,Min: Cardinality.Max: $1.3 1 1-Je S : Un kn o wn lGmbGrSlOt:
fro
1 1
:*CONSTRUCTION .METHOD from PHYSICAL .ENTITIES OVERRIDE :3 : ( ONE ( 15F A I; 12R E 13ATE P R I MI T I 'v'E P 0 LYME R )
Cardinality.Min: 1 Cardinal ity , Max : 1 'I DG:31::r i b G:3 h 0 w a mG rlib G r Comment: V~~IJGS: Unknown !GmbGrSlOt: :*MOLECULAR.WEIGHT Inheritance: OVERRIDE ValueClass: INTEGER Cardinality.Min: 1 Cardinality , Max : 1 il’o3 1 1-JG S : U n k n 0 wn
1184
Communications of the ACM
f ram
l:t11j G12t. [j G1213mKI11:2G:s "
PHYSlCAL.ENTITIES
November 1985
Volume 28 Number 11
Special Issue
l
An observation currently unpredictable by the current model because the model is either not detailed enough or is incomplete. The observation in this case must correlate with the model because an important object of the model is involved or it relates to an effect predicted by the model.
such a system can assist the scientist in at least the following ways: l
l
Build a mechanism for postulating extensions or corrections to the current theory: a constrained regulatory theory generator. The overall approach to this mechanism is perhaps the most interesting problem in our work. In discussions with other computer scientists, two different notions of reasoning have evolved: “or” reasoning, where the theory construction process consists of hierarchical refinement of abstract ideas into more detailed ones, and “and” reasoning, where the theory is built up in little pieces at many different levels simultaneously. We see strong evidence for both types of reasoning within Yanofsky’s project. In fact, as stated above, the global model of Yanofsky’s laboratory is a hybrid one. Individual graduate students performed “and“ tasks, filling in details of seemingly unrelated pieces of the model. Yanofsky was the master “or” reasoner, slowly building a hierarchical model of the new regulatory mechanism. Build a mechanism for evaluating alternative theories. This would include rating the theories based on plausibility, selectability, completeness, significance, and so on. We hope that the evaluation process produces information that will be useful in discriminating among the possible theories. Test the entire structure on the evolving trp-operon regulatory system. Experiment with different initial knowledge bases to see how the discovery process is altered by the availability of new techniques, analogous systems, etc. Potential Utility and Generality of the MOLGEN Discovery System As the preceding example shows, our view of the MOLGEN discovery system is as an “intelligence amplifier” for the biological scientist, designed not to replace human reasoning, but to augment it. We believe that
FIGURE15. The REPRESSORSUnit (left) Here we see that REPRESSORS has inherited slots from MOLECULAR.SWITCHES, PROTEINS, and PHYSICAL. ENTITIES, has explicitly stated that the value of the COFACTOR. POLARITY slot first defined above in the MOLECULAR.SWITCHES unit is ACTIVATE, and has de fined a new slot called ACTIVE.SITES. So, when the trpoperon discovery system wishes to consider repressors as types of proteins, it need only worry about properties like polypeptide ratios; when it is reasoning about repressors as molecular switches for regulation, it is concerned with prop erties like cofactors and switch states.
l
Thoroughness. It is very difficult for a biologist to have any real confidence that all major ramifications of a proposed addition or modification to a scientific theory have been explored. Lack of prejudice. Scientists -sometimes remain in a particular theoretical framework of reasoning for too long. A computational system, although maintaining its own “prejudices” in the form of its driving heuristics, can at least ensure that many different viewpoints are represented. Coordination of reasoning. It is difficult for humans to focus simultaneously on two or more different subproblems within a single, more complex problem. In the trp-operon example, this was seen in the need to coordinate the role of the ribosome binding site in the leader region with the secondary structures at the end of the leader region. A compound problemsolving architecture like a blackboard system helps to ameliorate this difficulty.
It is clear that much research needs to be done before we can claim to have even a rudimentary scientific theory discovery system within the narrow domain of regulatory genetics. However, we believe the framework of qualitative simulation combined with reasoning about the structural and functional knowledge used to model the system being simulated is generalizable to a wide variety of domains. Once the underlying theories and models of a domain are represented as a knowledge base of the form described in this article, the same reasoning methods we use in regulatory genetics should be applicable. We wish to thank our many colleagues who have contributed to the work described in this article. Particular thanks to Peter Karp for providing the scenario of a discovery program and to IntelliGenetics for providing illustrations of the SEQ program.
Acknowledgments.
FURTHER READINGS JamesWatson’s classic book, The Molecular Biology of the Genefrom W. A. Benjamin (New York, 1970) is a comprehensive overview of molecular genetics. The most comprehensive descriptions of computer programs for DNA sequence analysis may be found in special issues of the journal Nucleic Acids Researchpublished in January 1982 (Volume 10, Number 1) and 1984 (Volume 12, Number 1) and available as collections entitled The Applicafions of Computers fo Research on Nucleic Acids from IRL Press (Oxford, England]. An excellent review of the Yanofsky work on attenuation appears in “Attenuation in the Control of Expression of Bacterial Operons,” Nature, Volume 289, February 26, 1981, pages 751-758. The original MOLGEN project is described in “The Concept and Implementation of Skeletal Plans” by Peter Friedland and Yumi Iwasaki in the Journal of Automated Reasoning, Volume 1, Number 2, 1985, pages 161-208.
November 1985 Volume 28 Number II
Communicationsof the ACM
1185
Special Issue
REPRESENTATIVE PROBLEMSCENARIO Since our work is still in the early stages, the following scenaPi0 is clearly speculative. However, it illustrates (in a translated-to-English form) the kinds of reasoning we believe a theory-formation system should be able to perform. We have also attempted to indicate the kinds of knowledge used to reason at each stage. (Human input is indicated by the “4’ symbol, system output by “>,‘I and comments are enclosed in square brackets.) < Consider a normal E. coli growing in a @-poor medium. > Understood: Trp operon is activated. Synthesis of biosynthetic enzymes occurs to promote growth. [The system uses its description of a functional trp operon to initialize and execute a simulation of the described experiment. It then displays a summary of the results of that simulation.] < Correct. Now consider a similar situation, but with a mutation in the promoter region of the trp operon, and no growth occurs. > Understood: Transcription is disabled, preventing cell growth because of frp starvation. [The system understands the technique we have mentioned, knowing it can produce a certain range of effects, e.g., lack of functionality in a given region of DNA. The system propagates this effect through its simulation and again summarizes the result.] c How would you explain the above cells growing normally? > We can assume that the mutation you mention has no deleterious effect. [Above, the system realized the mutation technique could produce several different types of mutants, but only told us about the one it presumed we would be most interested in. Here it is asked to account for behavior different from that which it predicted, and can do so by examining these other mutants.] < Now consider a mutation somewhere in the leader region that causes constitutive synthesis. How would you explain that? [Constitutive synthesis means constant or unregulated production of the end product of the genome.] > The repression mechanism has been disabled in some A good text on methods for machine learning and discovery is Machine Learning: An Artificial Intelligence Approach by Michalski, Carbonell, and Mitchell from Tioga Publishing Company, Pal’0 Alto, California. Three articles that discuss thfe basic methods for knowledge representation (including a detailed description of KEE) may be found in th.e September 1985 issue of CACM in the section entitled “Special Section on Architectures for Knowledge-Based Systems” (Volume 28, Number 9, pages 902-941). CR Categories and Subject Descriptors: 12.1 [Artificial Intelligence]: Applications and Expert Systems--medicine and science; 1.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods: 12.6 [Artificial Intelligence]: Learning: 1.2.8 [Artificial Intelligence]: Problem
1188
Communications of the ACM
way. Most likely it is that the operator no longer allows repressor binding. [Above we discussed a form of simulation-based forward reasoning that the system might use. Here we note that backward reasoning may be appropriate as well. The system starts with a general principle (the disabling of repression), attempts to figure out what effects might have caused this, and correlates these effects with elements of the experiment description.) < Now we observe the following: A mutation is introduced in the leader region. Trp enzyme production is still regulated by trp in the medium; however, the #p-production rate in a &p-poor environment is significantly higher than normal. These mutants fail to recombine with trp-E. > I do not understand this. The mutation seems to map into the region between the operator and the first structural gene. This region has no known regulatory function, and hence a mutation here should not affect trp production, although it is possible that this mutation affects transcription or translation speed. Thus I postulate this piece of DNA has some function, possibly a positive effect that the mutation enhances, but probably a negative effect that we have disabled. [We are now into a discovery mode of proposing explanations to a phenomenon that requires some changes in our model of the regulation of frp. The system knows that a mutation could either have deleted a functional site of which it was currently unaware, or added some new functionality. Its biological heuristics tell it the former is considerably more likely.] < What types of negative effects might we be disabling? > Perhaps there was translation or transcription termination or pausing at the mutated site: or perhaps this site changed the elongation rate of polymerase or ribosomes; or perhaps it produces a repressor protein; or perhaps the mutations introduced a promoter into the leader region, although this is not likely because the promoter would have to be regulated by trp. [These are speculations on additional structures and functions in the leader region of the molecule. They come from a constrained theory generator. The obvious next step is to rank these theories in order of plausibility and then suggest experiments to differentiate among the possibilities.] Solving, Control Methods and Search; I.6 (Simulation and Modeling]; [Life and Medical Sciences]: biology General Terms: Design, Theory Additional Key Words and Phrases: AI applications to molecular biology, discovery, theory formation
1.3
Authors’ Present Addresses: Peter Friedland, Dept. of Computer Science, Knowledge Systems Laboratory, Stanford University, 701 Welch Road, Palo Alto. CA 94304; Laurence H. Kedes, Dept. of Medicine, Veterans Administration Medical Center, Stanford University. 3801 Miranda Avenue. Palo Alto, CA 94304. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
November 1985
Volume 28
Number 11