Knowledge Base Management for Experiment Planning in ... - CiteSeerX

Knowledge Base Management f o r Experiment Planning i n Molecular Genetics Nancy M a r t i n Computing and I n f o r m a t i o n Science U n i v e r s i t y of New Mexico Albuquerque, New Mexico Peter F r i e d l a n d , Jonathan K i n g . Mark S t e f i k Computer Science Department Stanford U n i v e r s i t y Stanford, California

The use of a r e p r e s e n t a t i o n language i n v o l v i n g schemata and associated d e r i v e d models has been extended to i n c l u d e a l l aspects of domain knowledge and s t r a t e g y and h e u r i s t i c problem s o l v i n g knowledge. This uniform r e p r e s e n t a t i o n w i l l allow the extension of knowledge base management techniques for acquisition and r e t r i e v a l o f procedural knowledge. D e s c r i p t i v e Terms: genetics, h e u r i s t i c problem solving, knowledge bases, MOLGEN, planning systems, r e p r e s e n t a t i o n of knowledge, rule-based systems. 1

Introduction

The design of MOLGEN is based on the proposition that s u c c e s s f u l problem s o l v i n g i n complicated domains r e q u i r e s the use of l a r g e amounts of domain s p e c i f i c knowledge. As a r e s u l t , the r e p r e s e n t a t i o n of t h i s knowledge is a problem of p r a c t i c a l importance. The domain s p e c i f i c knowledge i n c l u d e s knowledge about the o b j e c t s , the t r a n s f o r m a t i o n s a p p l i c a b l e to the o b j e c t s and i n f o r m a t i o n about use, e f f e c t i v e n e s s , and cost of the t r a n s f o r m a t i o n s . The r e p r e s e n t a t i o n should f a c i l i t a t e the a c q u i s i t i o n o f the knowledge from the user and the r e t r i e v a l of r e l e v a n t knowledge d u r i n g problem s o l v i n g . An i n n o v a t i v e f e a t u r e of MOLGEN w i l l be i t s a b i l i t y t o represent a l l problem s o l v i n g knowledge in a u n i f o r m manner. In a d d i t i o n , the system design i n s u r e s t h a t : 1 . Planning s t r a t e g i e s and heuristics w i l l not be b u i l t i n t o the code of the system. 2. Procedures w i l l be expressed in such a way t h a t the c o n t e n t s of the named component p a r t s of the procedures can be r e f e r e n c e d . The paper f i r s t g i v e s a b r i e f d e s c r i p t i o n of domain o b j e c t s , t r a n s f o r m a t i o n s , and the design of e x p e r i m e n t a l p l a n s . Then Section 3 describes the r e p r e s e n t a t i o n of knowledge and the techniques being developed to manage the knowledge base. Finally the c u r r e n t s t a t e o f implementation i s discussed. 2

Experimental Tasks

MOLGEN w i l l design both s y n t h e s i s and analysis experiments. Synthesis experiments b u i l d a s p e c i f i e d DNA s t r u c t u r e from completely or partially specified starting materials. Analysis experiments discriminate among competing s t r u c t u r a l hypotheses. An example is g i v e n in Section 2 . 3 . * This research was supported by the N a t i o n a l Science Foundation under Grants NSF MCS76-11935 and MCS76-116M9. J. King is supported by an IBM graduate research f e l l o w s h i p .

In t h i s work we are assuming t h a t the competing hypotheses are given to the system and the task is to d i s c r i m i n a t e among them. See [Feitelson77] for a description of knowledge sources f o r the systematic f o r m a t i o n and t e s t i n g of hypotheses in a recent g e n e t i c s experiment. The design of molecular g e n e t i c s experiments may be viewed w i t h i n a f a m i l i a r AI paradigm: s t a t e - s p a c e search. World s t a t e s f o r these searches c o n s i s t of sets of DNA s t r u c t u r e s along w i t h d e s c r i p t i o n s of an experimenter s knowledge about them. State t r a n s f o r m a t i o n s correspond to l a b o r a t o r y techniques w h i c h , a l t e r s t r u c t u r e s o r increase an experimenter's knowledge about s t r u c t u r e s . An a c t u a l experiment as c a r r i e d out in the l a b o r a t o r y can be represented by an i n i t i a l world s t a t e and a sequence of t r a n s f o r m a t i o n s . A l a b o r a t o r y problem can be represented by an i n i t i a l world s t a t e and d e s i r e d f i n a l world s l a t e . However, an experimental plan could be more complicated than the t r a c e of a particular experiment as it was performed in the l a b o r a t o r y . The plan must express the f a c t t h a t at a g i v e n p o i n t in an experiment the next t r a n s f o r m a t i o n may depend on the outcome of the present: transformation. It also must allow t r a n s f o r m a t i o n s to remain unordered when the order of a p p l i c a t i o n is not i m p o r t a n t . Thus, an experimental plan is a more complicated s t r u c t u r e than a simple sequence of t r a n s f o r m a t i o n s . 2.1

Genetics Knowledge

2.1.1

Domain Objects

The science of molecular g e n e t i c s is centered around the s t r u c t u r e and f u n c t i o n of the deoxyribonucleic acid (DNA) molecule. DNA c o n s i s t s of two p a r a l l e l s t r a n d s composed of chemical groups called "nucleotides". All n u c l e o t i d e s are composed of a sugar p a r t which forms the DNA "backbone" and a base which p a r t i c i p a t e s in "hydrogen bonds" between the p a r a l l e l strands. Only f o u r bases commonly e x i s t in the DNA molecule, Adenine, Guanine, C y t o s i n e , and Thymine — abbreviated as A, G, C, and T. Furthermore. A and T w i l l o n l y form hydrogen bonds w i t h each o t h e r ; l i k e w i s e f o r C and G. DNA s t r u c t u r e s have b i o l o g i c a l , t o p o l o g i c a l and s e q u e n t i a l a t t r i b u t e s . The molecule can be viewed as a chromosome whose component p a r t s are enes having s p e c i f i c b i o l o g i c a l and c o n t r o l unctions. I n a d d i t i o n t o the double stranded topology described above, DNA has many p o s s i b l e c o n f i g u r a t i o n s . Hydrogen bonds and backbone bonds can be broken r e s u l t i n g in s t r u c t u r a l anomalies such as n i c k s ( s i n g l e backbone bond b r o k e n ) , gaps ( s e v e r a l bases missing along one s t r a n d ) , bubbles (a sequence of i n t e r n a l hydrogen bonds broken so the s t r a n d s s e p a r a t e ) . The f i n a l a t t r i b u t e o f the molecule is the a c t u a l sequence of bases used to form the i n d i v i d u a l s t r a n d s . I n a d d i t i o n t o the f o u r bases described above, t h e r e are a l t e r n a t e bases which can occur.

ApnlIcat tons-2: 882

Mart In

I n any p a r t i c u l a r experiment, a t t r i b u t e s of each type may be i m p o r t a n t * The g e n e t i c i s t o f t e n does not know enough about the s t r u c t u r e of a molecule to know p r e c i s e l y where a f e a t u r e , such as a gene or a s p e c i f i c base sequence, is on the molecule. 2.1.2

knowledge contexts, c o s t s or evaluating laboratory 2*3

Domain Transformations

Modification transformations consist of making and breaking bonds and separating molecules. These a c t i o n s can be c a r r i e d out by a v a r i e t y of p h y s i c a l , chemical and b i o l o g i c a l techniques. A commonly used group of n a t u r a l l y o c c u r i n g reagents are c a l l e d enzymes. These are catalyze reactions that make or break used t o bonds

Hypothesis 1: The sample c o n t a i n s molecules w i t h the sequence AAAAAAA...(called poly A ) , where the l e n g t h of the poly A sequence is at l e a s t t h i r t y bases. Hypothesis molecules.

2: The

The usual focus the poly A sequence.

The s p e c i f i c i t i e s of the t r a n s f o r m a t i o n s vary over a wide range. For example, each r e s t r i c t i o n enzyme recognizes a s p e c i f i c base p a t t e r n and c u t s the molecule at t h a t s p e c i f i c site* In c o n t r a s t , c e r t a i n exonuclease enzymes cut the molecule from one end l i b e r a t i n g fragments of unpredictable size.

sample c o n t a i n s

in such a problem

no such would be

Planl: One a t t r i b u t e of a long poly A sequence is that it will have weaker hydrogen bonds than other r e g i o n s . This knowledge is e a s i l y deduced from the f a c t hat A ' s form c o n s i d e r a b l y weaker hydrogen bonds w i t h T ' s than do C s w i t h G's. E x p l o i t i n g t h i s weakness leads t o the idea o f p a r t i a l l y d e n a t u r i n g ( b r e a k i n g the weaker hydrogen bonds i n ) the molecules and l o o k i n g f o r the r e s u l t i n g bubble shape under an e l e c t r o n microscope.

Unfortunately, there is not a set of p r i m i t i v e t r a n s f o r m a t i o n s t h a t w i l l break o r make any a r b i t r a r y s p e c i f i e d bond on a m o l e c u l e . A p a r t of a molecule t h a t is a conceptual e n t i t y to a g e n e t i c i s t , such as a gene or a r e g i o n w i t h a p a r t i c u l a r base sequence, may not be recognizable by any m o d i f i c a t i o n t r a n s f o r m a t i o n . Therefore, f o r each s i t u a t i o n the g e n e t i c i s t must f i n d some combination of transformations t h a t approximn'es the t r a n s f o r m a t i o n d e s i r e d .

Plan2: Recognizing t h a t an a p p r o p r i a t e complementary base sequence would bind s e l e c t i v e l y to an embedded poly A r e g i o n leads to a " p r o b e " plan, A probe of r a d i o a c t i v e l y l a b e l l e d poly T ( t h e complementary sequence) is o b t a i n e d . The probe is mixed w i t h the DNA sample. F i n a l l y , an observation technique is used to see i f r a d i o a c t i v i t y has been i n c o r p o r a t e d i n t o the DNA molecules.

A given m o d i f i c a t i o n t r a n s f o r m a t i o n may produce many products in a s i n g l e a p p l i c a t i o n and may not always carry out an i d e n t i c a l l y reproducible a c t i o n . An example is the enzyme Ligase which causes pieces of double stranded L)NA to j o i n forming longer p i e c e s . The enzyme w i l l not l u s t l o i n the two d e s i r e d segments. The f i n a l sample w i l l c o n s i s t o f segments o f v a r y i n g l e n g t h and varying composition. If a particular composition is d e s i r e d , it must be recognized and separated from the sample by another transformation.

Plan3: A f i n a l p e r s p e c t i v e makes use of the f a c t t h a t the b i o l o g i c a l f u n c t i o n of DNA is to create p r o t e i n s . In the case of Poly A, the p r o t e i n produced would be poly-valine. The e x p e r i m e n t a l plan i s t o c r e a t e p r o t e i n s from the DNA sample and t e s t f o r the presence of p o l y valine.

Observational transformations are p a r t i c u l a r l y important t o the g e n e t i c i s t . They are used t o v e r i f y o r p r e d i c t the r e s u l t s o f modifications. Common observational transformations include: analytic electrophoresis (a l a b o r a t o r y technique which d i s c r i m i n a t e s among DNA molecules based p r i m a r i l y upon t h e i r molecular weight and electric charge), ultraviolet fluorescence (exposing a sample of molecules to u l t r a v i o l e t r a d i a t i o n t o determine i f a structure is s i n g l e or double s t r a n d e d ) . and e l e c t r o n microscopy (,to observe t o p o l o g i c a l d e t a i l s ) . Genetic

A n a l y s i s Example

One method of a n a l y s i s p l a n n i n g c o n s i s t s of f o c u s s i n g i n i t i a l a t t e n t i o n upon some f e a t u r e of the problem and c r e a t i n g a model which emphasizes the a t t r i b u t e s o f t h a t f e a t u r e . I n t h i s example we i l l u s t r a t e how d i f f e r e n t a t t r i b u t e s of DNA can lead to d i f f e r e n t plans f o r the same experimental task. The problem is to d i s c r i m i n a t e between the two hypotheses:

There are two types of domain t r a n s f o r m a t i o n s : m o d i f i c a t i o n and o b s e r v a t i o n . By m o d i f i c a t i o n we mean a t r a n s f o r m a t i o n which changes DNA s t r u c t u r e s . By o b s e r v a t i o n we mean a transformation which changes the s t a t e of i n f o r m a t i o n a v a i l a b l e t o the experimenter about those s t r u c t u r e s .

2.2

such as p l a n sketches for various design cost h e u r i s t i c s which p r e d i c t the transformations, and heuristics for the relevance and s p e c i f i c i t y of t r a n s f o r m a t i o n s t o the c u r r e n t problem.

3

Representation of Knowledge

We can summarize the knowledge which MOLGEN w i l l use i n experiment p l a n n i n g a s f o l l o w s : Static

Knowledge—

Knowledge which is f i x e d d u r i n g problem s o l v i n g .

(1) O b j e c t : Conceptual e n t i t i e s o f the system l i k e DNA molecules, enzymes, samples, l a b t e c h n i q u e s . (2) A c t i o n : World State transformations corresponding t o the effects of laboratory techniques.

Abstractions

The a c t u a l l a b o r a t o r y t r a n s f o r m a t i o n s are o f t e n n o n s p e c i f i c both i n terms o f the o b j e c t s they act upon and the r e s u l t s they produce. This c h a r a c t e r i z a t i o n o f the domain t r a n s f o r m a t i o n s leads n a t u r a l l y to the use of a b s t r a c t o b j e c t s and t r a n s f o r m a t i o n s in p l a n n i n g experiments. The a b s t r a c t i o n s correspond t o the conceptual e n t i t i e s and t r a n s f o r m a t i o n s of the g e n e t i c i s t and w i l l be an i m p o r t a n t p a r t of the knowledge base.

(3) S t r a t e g y / C o n t r o l : Genetic strategies and general solving heuristics. Dynamic

Knowledge—

planning problem

Knowledge which can change d u r i n g problem s o l v i n g .

(4) World S t a t e : A description of the simulated genetics experimental environment and measurements of i t .

S i m i l a r l y , the p l a n n i n g s t r a t e g i e s used by the g e n e t i c i s t i n c r e a t i n g a p l a n w i t h these a b s t r a c t i o n s are i m p o r t a n t h e u r i s t i c s f o r the system. This encompasses a broad range of

(5) Planning S t a t e : A r e p r e s e n t a t i o n of the p a r t i a l l y designed experiment and the a c t i v i t y o f the problem s o l v i n g process.

Applictions-2:

Mart In 883

A similar classification of knowledge a p p l i e s f o r many problem s o l v i n g systems. Besides r e p r e s e n t i n g o b j e c t s and a c t i o n s , a l l systems must represent problem s o l v i n g s t r a t e g i e s and s t a t e s . These are o f t e n embedded in the program or c o n t r o l structure r a t h e r than being e x p l i c i t entries in the knowledge bases. However, embedding knowledge i n t h it sh w in i sa y way may s e r i o u s l y impair the system s Changing, adding, or d e l e t i n g an fflexibillity. lexibility i n s t a n c e of any type of problem s o l v i n g knowledge can have wide ranging e f f e c t s on knowledge of a l l types throughout the system. In a l a r g e knowledge base, i t is v i t a l f o r the system t o a s s i s t the user i n l o c a t i n g these e f f e c t s . I n a d d i t i o n , any system which w i l l be extended to work on new problems must have f a c i l i t i e s f o r adding new i n s t a n c e s of each category of knowledge.

3.1

Cprpmon R e p r e s e n t a t i o n Solving Knowledge

for All

Problem

To represent these types of knowledge, we adopt the use of a s t r u c t u r e d , o b j e c t - c e n t e r e d knowledge base. This idea has r e c e n t l y been advanced f o r problem s o l v i n g in KRL [Bobrow77] and f o r knowledge a c q u i s i t i o n by TEIRESIAS [ D a v i s 7 6 ] . b u i l d i n g on the i d e a s , among o t h e r s , of SIMULA classes [ D a h l 7 2 ] , FRAMES [ M i n s k y 7 4 ] , and SMALLTALK classes [Goldberg76]. The main components of these r e p r e s e n t a t i o n s , v a r i o u s l y c a l l e d " f r a m e s " , "units", or "schemata", can be thought of (following [Davis76]) as record d e f i n i t i o n s extended to support i n t e r r e l a t i o n s h i p s and to accommodate attached procedures. We w i l l r e f e r to them as "schemata" , Each schema w i l l be composed of slots with associated value or type specifications and attached procedures. All schemata will be organized in a g e n e r a l i z a t i o n / s p e c i a l i z a t i o n hierarchy s i m i l a r to t h a t of o t h e r systems. Associated w i t h each schema w i l l be a model which summarizes the ways i n which the v a r i o u s s l o t s have been f i l l e d . These schemata and t h e i r associated models provide the knowledge needed by management r o u t i n e s f o r a c q u i r i n g and updating the knowledge base and f o r many problem s o l v i n g t a s k s . The issues i n v o l v e d in r e p r e s e n t i n g knowledge in such a schema system are s i m i l a r to the issues when the r e p r e s e n t a t i o n is a semantic n e t . See [Woods75] and [Brachman76] f o r d i s c u s s i o n s o f these i s s u e s . A novel aspect of the design of the MOLGEN knowledge base i s the r e p r e s e n t a t i o n o f a l l types of problem s o l v i n g knowledge in a common formalism — as i n s t a n c e s of schemata. Procedural knowledge w i l l be represented in such a way t h a t the system can e a s i l y i n s p e c t any procedure. The schema system provides a mechanism for breaking procedures i n t o component p a r t s which can be addressed s e p a r a t e l y , and are thus a c c e s s i b l e to the system (see s e c t i o n 3 . 2 ) . This w i l l a i d a c q u i s i t i o n and use d u r i n g problem s o l v i n g . Schemata f o r procedural knowledge are discussed more f u l l y below. The i n f o r m a t i o n g a t h e r i n g process at the beginning o f a problem s o l v i n g session is s i m i l a r to the process of a c q u i r i n g new instances of domain o b j e c t s and t r a n s f o r m a t i o n s . By using a uniform r e p r e s e n t a t i o n , the g a t h e r i n g o f t h i s i n i t i a l i n f o r m a t i o n from the user w i l l b e handled by the same mechanisms which acquire g e n e t i c knowledge f o r the permanent knowledge base. 3.1.1

The

Schema Hierarchy

and the Creation of

The g e n e t i c i s t c l a s s i f i e s domain o b j e c t s and processes a c c o r d i n g to t h e i r common p r o p e r t i e s . For example, a l l p h y s i c a l measurements are grouped t o g e t h e r ; a l l enzymes are grouped t o g e t h e r ; a l l chemical techniques are grouped t o g e t h e r . The schemata w i l l be organized in a h i e r a r c h y t h a t r e f l e c t s these c l a s s i f i c a t i o n s . Schemata w i l l b e l i n k e d from most general to most s p e c i a l i z e d category, with specializations inheriting designated p r o p e r t i e s ( s l o t s , r e s t r i c t i o n s o n t h e i r

values. attached generalizations.

procedures)

from

their

Every i n d i v i d u a l e n t i t y i n the system w i l l be created as an i n s t a n c e of some schema (known as its " p r o t o t y p e " ) . The instance w i l l inherit its set of s l o t s from i t s p r o t o t y p e schema and from the g e n e r a l i z a t i o n s of the p r o t o t y p e . The values f i l l i n g the s l o t s must a l s o obey the r e s t r i c t i o n s specified in the prototype and its g e n e r a l i z a t i o n s . These values w i l l be i n s t a n c e s of other schemata. For example, we might have a "growth medium" s l o t in the schema f o r " c u l t u r e growth" which w i l l be f i l l e d by an i n s t a n c e of the schema " g e l " ; the " r e s o l v i n g power" s l o t of the schema for "electrophoresis" (an a n a l y t i c a l technique) might be f i l l e d by an i n s t a n c e of the schema f o r " r e a l number". 3.1.2

Grouping E n t i t i e s bv Function

The g e n e t i c i s t ' s c l a s s i f i c a t i o n h i e r a r c h y does not always r e f l e c t the f u n c t i o n s t h a t domain o b j e c t s have in common. For example, r e s t r i c t i o n enzymes and l i g a s e s are both classes of enzymes, but the former performs c u t t i n g a c t i o n s on DNA, w h i l e the l a t t e r performs s e a l i n g a c t i o n s . From the p o i n t of view of seeking a t o o l to perform c u t t i n g d u r i n g an experiment, r e s t r i c t i o n enzymes are more c l o s e l y r e l a t e d to some p h y s i c a l methods than to enzymes such as ligase. Functional r e l a t i o n s h i p s w i l l b e represented s e p a r a t e l y from the classification hierarchy. The schema h i e r a r c h y i s not meant to indicate a l l possible r e l a t i o n s among e n t i t i e s o r a l l p o s s i b l e ways i n which something can be viewed. 3.1.3

Derived Models Schemata

of

Typical

Instances

of

The TEIRESIAS program [Davis76] maintained tabulations of the individual and joint occurrences o f s p e c i f i c tokens w i t h i n the premises and a c t i o n s of each r u l e t y p e . These t a b u l a t i o n s were c a l l e d " r u l e models". They were used to c r e a t e e x p e c t a t i o n s about the c o n t e n t s of new r u l e s . Models are d e r i v e d from a c t u a l i n s t a n c e s and record the values which have been entered in each s l o t of a p a r t i c u l a r schema. We w i l l use models f o r prompting the user f o r p o s s i b l y m i s s i n g i n f o r m a t i o n or to suggest d e f a u l t v a l u e s . For example, the d e r i v e d model f o r the " i n i t i a l world s t a t e " schema might note t h a t e i g h t y percent of the instances which mention pH of the sample a l s o mention the c o n c e n t r a t i o n of magnesium i o n s . A user who c r e a t e s an i n i t i a l world s t a t e mentioning pH but not magnesium c o n c e n t r a t i o n could be asked whether t h a t should be i n c l u d e d t o o , based on the c o r r e l a t i o n o f occurrences. 3.1,4

Uses

of

the Sehema Hrarchy

A major use of the schema h i e r a r c h y w i l l be t o guide the a c q u i s i t i o n o f a l l types o f domain knowledge. This use of the schema h i e r a r c h y is motivated by i t s use in the T e i r e s i a s system. In the process of e n t e r i n g a new enzyme, a user w i l l be guided through the h i e r a r c h y , from o b j e c t s to enzymes to a p a r t i c u l a r c l a s s of enzymes, u n t i l the proper p r o t o t y p e i s l o c a t e d . I f the necessary p r o t o t y p e does not e x i s t , the system w i l l guide the user i n c r e a t i n g i t and i n s e r t i n g i t c o r r e c t l y i n the h i e r a r c h y . The p r o t o t y p e w i l l then g i v e i n s t r u c t i o n s to the system on how to prompt the user f o r i n f o r m a t i o n about the new i n s t a n c e , i n c l u d i n g documentation. The p r o t o t y p e a l s o w i l l g i v e i n s t r u c t i o n s on how to i n c o r p o r a t e the new Instance i n t o system l i s t s o r t a b l e s a s necessary. A second important use of the schema hierarchy w i l l be for organizing r e t r i e v a l of o b j e c t s matching p a r t i c u l a r d e s c r i p t i o n s d u r i n g problem s o l v i n g . Derived models can play a r o l e I n i n c r e a s i n g the e f f i c i e n c y o f t h i s search process. By comparing the d e s i r e d d e s c r i p t i o n w i t h the s t a t i s t i c a l i n f o r m a t i o n i n the model, a

Applications-2: 884

Martin

measure of the l i k e l i h o o d of f i n d i n g an i n s t a n c e matching the d e s c r i p t i o n can be o b t a i n e d . When a match is u n l i k e l y , the search could be d i r e c t e d to another branch o f the h i e r a r c h y . 3.2

In t h i s example of a domain t r a n s f o r m a t i o n , we w i l l i l l u s t r a t e the k i n d of knowledge contained i n the r u l e , the o r g a n i z a t i o n o f the schema f o r the r u l e and the way t h a t the c r e a t i o n of the r u l e as an i n s t a n c e of a schema l e t s it be r e t r i e v e d in more than one way.

Schemata f o r Procedural Knowledge

A s i m p l i f i e d schema f o r d e t e c t i o n techniques is shown in F i g u r e 1. The bookkeeping s l o t s p r o v i d e a place f o r documentation of the knowledge base. Next are s l o t s which must be f i l l e d when an i n s t a n c e of the schema (a s p e c i f i c d e t e c t i o n r u l e ) is c r e a t e d . Adiacent to the s l o t names are type r e s t r i c t i o n s . The s l o t s name those p a r t s of the r u l e which are determined by the instances of the rule. I n t h i s example, t h e r e are t h r e e such s l o t s : Preparatory-Method whose value is to be an i n s t a n c e of the P r e p a r a t o r y - r u l e schema; M i n Sample-Size whose value is to be an i n s t a n c e of the Mass schema; and F e a t u r e - L i s t whose value is to be a l i s t of i n s t a n c e s of the DNA-Features schema. In a c q u i r i n g an instance of the Detection-Technique schema, o n l y these s l o t s need be f i l l e d . An i n s t a n c e of the Detection-Technique schema is g i v e n in F i g u r e 2. The procedural p a r t of the schema is contained in the form s l o t . The code in Figure 2 in the Form s l o t c o n s i s t s of the Figure 1 code w i t h the a p p r o p r i a t e s u b s t i t u t i o n s made f o r the named components.

One of the novel aspects of the design of the MOLGEN knowledge base is the use of schemata to represent p r o c e d u r a l knowledge. Instances of these schemata w i l l be the " r u l e s " t h a t express t r a n s f o r m a t i o n a l and problem s o l v i n g knowledge. In most s i m i l a r systems, the knowledge about r u l e types is g e n e r a l l y in the program code. These r u l e s range from general QLISP procedures to highly r e s t r i c t e d formats. For example, TEIRESIAS expects all rules to be "IF-THEN" r u l e s . Knowledge about the two named components of the r u l e , the premise and the c o n c l u s i o n , i s i n the c o n t r o l program. Using t h i s knowledge, the system can prompt the user f o r i n f o r m a t i o n t o f i l l these components and can r e t r i e v e r u l e s by r e f e r e n c e to the content of the components. Rather than express r u l e s in a t r a d i t i o n a l p r o d u c t i o n system f o r m a t , we use e x p l i c i t i t e r a t i o n statements, c o n d i t i o n a l statements and assignment s t a t e m e n t s . The g r a n u l a r i t y of the r u l e s depends on the type of rule. Rules expressing modification transformations will be relatively large, r e f l e c t i n g the g e n e t i c process being modeled. Planning r u l e s w i l l vary i n s i z e from small t r a d i t i o n a l r u l e s to more complex r u l e s . Our design extends TEIRESIAS' knowledge management capabilities of acquisition and content r e f e r e n c i n g to handle the complex types of r u l e s . To do t h i s we w i l l use the s l o t mechanism. One s l o t w i l l correspond t o each component o f a r u l e and may be f i l l e d by i n s t a n c e s of o t h e r e n t i t i e s — that i s , oblects or other operations. This design i s used f o r a l l procedures i n the system from predefined system o p e r a t i o n s to the domain r u l e s which w i l l be entered by the user. For example, the components of an i t e r a t i o n p r i m i t i v e (FOREACH) are described by s l o t s corresponding to an ITERATION-LIST and an ITERATION-BODY. S l o t s of domain r u l e s correspond to domain r e l e v a n t components as i l l u s t r a t e d by the schema f o r detection techniques described in the next section. I n both cases s p e c i a l i z e d schemata w i l l guide the f i l l i n g o f the s l o t s .

The process of c r e a t i n g the p r o t o t y p e FORM i s more c o m p l i c a t e d . It involves using the schemata f o r each of the component p a r t s of the rule. 3.2.2 The OBSERVE Operation An important p a r t of the Detection-Technique r u l e is the o p e r a t i o n "OBSERVE". OBSERVE is a f u n c t i o n t h a t makes world s t a t e changes t h a t r e f l e c t an increase in the e x p e r i m e n t e r ' s knowledge at any p o i n t in the experiment. OBSERVE has two basic actions: f i r s t i t must c a l l the a p p r o p r i a t e procedures to make an o b s e r v a t i o n on One world state. Then, it must make a p p r o p r i a t e changes to the world s t a t e t o r e f l e c t t h i s o b s e r v a t i o n . This p cess is guided by the schemata f o r the arguments of the f u n c t i o n . In t h i s case, each DNA f e a t u r e which can be observed w i l l have an associated p a t t e r n matching r u l e which can check f o r the f e a t u r e in a given structure. For example, many of the DNA-Features w i l l be i n s t a n c e s of the schema f o r DNA-Nucleotide-Graph and these schemata w i l l i n h e r i t a p a t t e r n matching r u l e from t h e i r p r o t o t y p e which performs a subgraph matching process. This a s s o c i a t i o n of the p a t t e r n - m a t c h i n g procedure w i t h the p a t t e r n t o be matched is similar in philosophy to c o n s t r u c t i o n s in the SMALLTALK [Goldberg76] system and w i l l help make the OBSERVE f u n c t i o n q u i t e general.

As a r e s u l t of t h i s r e p r e s e n t a t i o n paradigm, s t r a t e g i e s can be w r i t t e n to r e t r i e v e r u l e s by r e f e r e n c e to the content of named components. This w i l l i n s u r e t h a t r u l e s are not i n c o r r e c t l y r e t r i e v e d because o f a f a u l t y e x t e r n a l d e s c r i p t o r . A l s o , i t w i l l b e p o s s i b l e t o r e t r i e v e a r u l e using a v a r i e t y of p a t t e r n s , r a t h e r than by lust a s i n g l e name. Content r e f e r e n c i n g is m o t i v a t e d by a need to a s s o c i a t e a new r u l e w i t h a p p r o p r i a t e problem s o l v i n g c o n t e x t s w i t h o u t s p e c i f y i n g those c o n t e x t s a t the time the r u l e i s a c q u i r e d .

3.2.3

Rule r e t r i e v a l by means of content matching may be very i n e f f i c i e n t in a l a r g e system. In some i n s t a n c e s we w i l l precompute r e t r i e v a l by a s s o c i a t i n g w i t h each s t r a t e g y r u l e a " r u l e s e t " c o n t a i n i n g the r u l e s i t most f r e q u e n t l y r e t r i e v e s . Each r u l e s e t w i l l have an attached p a t t e r n which w i l l be checked a g a i n s t a newly entered r u l e to check whether it should be added to the r u l e s e t .

Multiple Access by

Means

of

Content

Decomposing a r u l e i n t o a c c e s s i b l e s l o t s w i l l make i t r e t r i e v a b l e b y content for different purposes. W e w i l l i l l u s t r a t e t h i s w i t h the r u l e f o r Electron-Microscopy ( f i g u r e 2 ) , This r u l e i s an i n s t a n c e of Detection-Technique schema. Like o t h e r types o f d e t e c t i o n t e c h n i q u e s , electron microscopy i s used f o r d e t e c t i n g c e r t a i n f e a t u r e s i n molecules. I t s a b i l i t y t o detect features is l i m i t e d b y v a r i o u s c o n s t r a i n t s , i l l u s t r a t e d i n our s i m p l i f i e d example by the s l o t s Min-Sample-Size and F e a t u r e - L i s t . The Detection-Technique schema ( f i g u r e 1) emphasizes both how i t s i n s t a n c e s are s i m i l a r and how they d i f f e r . Thus, in an experiment where two samples must be d i s t i n g u i s h e d , a s t r a t e g y r u l e might choose a d e t e c t i o n technique which can work w i t h a v e r y s m a l l sample s i z e i f the sample m a t e r i a l is precious. A l t e r n a t i v e l y , i f the sample is large, the d e t e c t i o n technique which can recognize the broadest range of f e a t u r e s might be d e s i r a b l e .

3 . 2 . 1 A Schema for Detection Tecftnjqv^g An i m p o r t a n t c l a s s of a c t i o n s in the g e n e t i c s domain is what we term " d e t e c t i o n techniques". The d e t e c t i o n r u l e schema (see f i g u r e 1) summarizes MOLGEN s r e p r e s e n t a t i o n f o r a c l a s s of l a b o r a t o r y techniques which measure r o p e r t i e s of DNA s t r u c t u r e s . When a MOLGEN r u l e o r a p a r t i c u l a r d e t e c t i o n technique i s executed, i t w i l l t r a n s f o r m a world s t a t e i n s t a n c e b y r e c o r d i n g the r e s u l t o f the o b s e r v a t i o n . For example, a d e t e c t i o n technique might increase the experimenter s knowledge about some of the t o p o l o g i c a l f e a t u r e s or the s t r u c t u r e s such as t h e i r l e n g t h or s u b s t r u c t u r e s . This new knowledge must be added to the world s t a t e .

P

Applications-2:

885

Martin

3.3

Example Schew and. Instance

—BOOKKEEPING SLOTS FILLED AT SCHEMA CREATION— Name: Detection-Technique G e n e r a l i z a t i o n : Laboratory-Technique Description: "Rules f o r d e t e c t i n g s t r u c t u r a l f e a t u r e s of DNA" ---SLOTS TO BE FILLED WHEN CREATING AN INSTANCE— Preparatory-Method: Min-Sample-Size: Feature-List:

Preparatory-Technique Mass LIST OF DNA-Feature

Form: IF WS-SAMPLE.Size > Min-Sample-Size THEN

APPLY Preparatory-Method TO WS-SAMPLE FOREACH S t r u c t u r e IN WS-SAMPLE.Structures FOREACH Feature IN F e a t u r e - L i s t OBSERVE(Feature,Structure) Figure 1 ** Schema f o r D e t e c t i o n Technique **

(Comment on n o t a t i o n : the n o t a t i o n A.B r e f e r s to the contents of s l o t B of schema A.) —BOOKKEEPING SLOTS FILLED AT SCHEMA CREATIONNAME: Election-Microscopy Instance-of: Detection-Technique Description: "Technique f o r observing and measuring s t r u c t u r a l f e a t u r e s . " —SLOTS FILLED WHEN RULE WAS CREATED— Preparatory-Method: Min-Sample-Size: Feature-list:

Protein-Stain 1 microgram (bubble, h a i r p i n - l o o p , circular.single-stranded, double-stranded,length,...)

Form: IF WS-SAMPLE.SIZE > 1 microgram THEN

APPLY P r o t e i n - S t a i n TO WS-SAMPLE FOREACH S t r u c t u r e IN WS-SAMPLE.Structures FOREACH Feature IN (bubble, hairpin-loop,circular,...) OBSERVE(Feature,Structure) Figure 2 **Electron-Microscopy

3.4

Rule**

(1) SPECIAL EDITORS: A c q u i r i n g schema instances from the user w i l l u s u a l l y be guided by a general procedure i n t e r p r e t i n g from the schema what i n f o r m a t i o n the user must p r o v i d e . However, some objects will have f a i r l y complicated s t r u c t u r e s . For example, models of DNA w i l l be graph s t r u c t u r e s w i t h segments represented by nodes and chemical bonds represented by l i n k s . For these o b j e c t s it w i l l be useful to attach s p e c i a l procedures t o t h e i r schemata t o guide a c q u i s i t i o n . We have already b u i l t and t e s t e d such a program f o r a c q u i r i n g DNA s t r u c t u r e s . It is a p i c t u r e - o r i e n t e d e d i t o r and has proven easy f o r g e n e t i c i s t s t o use. (2) RESTRICTION EXPERTS: The schema h i e r a r c h y discussed in a previous s e c t i o n is based on a strict generalization/specialization r e l a t i o n s h i p between schemata. This i s r e f l e c t e d not only i n the i n h e r i t a n c e o f s l o t s but also in the i n h e r i t a n c e o f r e s t r i c t i o n s o n the values f o r slots. We will require that the value s p e c i f i c a t i o n in the s l o t of a s p e c i a l i z a t i o n be more r e s t r i c t i v e than the value s p e c i f i c a t i o n i n the corresponding s l o t of the g e n e r a l i z a t i o n . The q u e s t i o n of whether one value s p e c i f i c a t i o n is more r e s t r i c t i v e than another is d i f f i c u l t to determine in a general way. For example, a l i s t s p e c i f i c a t i o n i s more r e s t r i c t i v e than another i f it is a subset of the l a t t e r ; a r e a l number interval s p e c i f i c a t i o n i s more r e s t r i c t i v e than another i f i t s endpoints are contained w i t h i n the o t h e r ; in some cases a DNA graph s p e c i f i c a t i o n might be considered to be more r e s t r i c t i v e than another graph specification if it c o n t a i n s the other as a subgraph. This v a r i e t y of comparisons between value r e s t r i c t i o n s suggests t h a t the r e s t r i c t i o n comparison should be associated w i t h the schema f o r the data type being r e s t r i c t e d . (3) PRE-COMPUTING ASSOCIATIONS: The MOLGEN knowledge base design has emphasized f l e x i b l e content r e f e r e n c e as a means f o r accessing domain transformation r u l e s . This can be the basis f o r s t r a t e g i c s e l e c t i o n o f domain r u l e s d u r i n g the p l a n n i n g process and is m o t i v a t e d by a need to (1) index new r u l e s to e x i s t i n g s t r a t e g i e s and (2) permit new s t r a t e g i e s t o index e x i s t i n g r u l e s i n new ways. Recognizing the i n e f f i c i e n c i e s of content r e f e r e n c i n g d u r i n g problem s o l v i n g , we w i l l pre-compute the i n d e x i n g b y a s s o c i a t i n g r u l e sets w i t h s t r a t e g i e s . Each r u l e set could have attached to it an attached procedure which would c o n s i s t of a content r e f e r e n c i n g r u l e . These attached procedures could be a c t i v a t e d whenever a new r u l e is added to the knowledge base to check f o r each r u l e set whether the new r u l e should be added to the s e t . Other procedures attached to the r u l e s e t w i l l check whether updating i s necessary when a r u l e is d e l e t e d or m o d i f i e d .

Matching

Procedural Attachment

Procedural attachment is a versatile mechanism for managing an object-centered knowledge base. The idea [Bobrow77] is to associate the procedure which performs a p a r t i c u l a r o p e r a t i o n w i t h the s p e c i f i c a t i o n o f the data o b j e c t s i n v o l v e d . In KRL[Bobrow77], attached procedures are a s s o c i a t e d w i t h the schemata f o r o b i e c t s and can be i n h e r i t e d by i n s t a n c e s * The KRL work emphasized the use of procedural attachment f o r g e n e r a l use i n problem s o l v i n g . In TEIRESIAS, " s l o t - e x p e r t s " played an important r o l e in knowledge a c q u i s i t i o n . In MOLGEN. attached procedures can be a c t i v a t e d when a c q u i r i n g a new p r o t o t y p e o r i n s t a n c e ; d u r i n g p a t t e r n matching; when t r y i n g t o f i l l a s l o t d u r i n g problem s o l v i n g ; a f t e r a s l o t has been f i l l e d : when accessing a value f o r a s l o t ; and when p r i n t i n g v a l u e s . An attached procedure can e i t h e r be a complicated program as in example 1 below or a r u l e in the knowledge base as in example 5, We g i v e a few examples of the use of a t t a c h e d procedures:

Acquisition of prototvpe or instance

(4) SUBGRAPH MATCHER: Another example of procedural attachment was discussed in the example of a Detection-Technique schema. In t h a t example, " o b s e r v a t i o n " procedures were attached to f e a t u r e s which could check a given s t r u c t u r e f o r the DNAfeature. Because many of the f e a t u r e s were instances of a DNA-graph, they i n h e r i t e d t h e i r o b s e r v a t i o n procedure from t h e i r g e n e r a l i z a t i o n ( i n those cases, the o b s e r v a t i o n procedure is a DNA subgraph m a t c h e r ) .

when Filled (5) SIDE EFFECTS: The attached procedure may a l s o be a " r u l e " in the knowledge base which describes the r e s u l t of changing the value of a parameter. Many p r o p e r t i e s or samples can be changed in numerous ways, and changing the p r o p e r t i e s can have many e f f e c t s on the r e s t of the sample. For example, sample pH can be a l t e r e d by the a d d i t i o n of a wide range of chemicals. Once p H i s changed, i t can a l t e r the a c t i v i t y o f every enzyme in the sample and denature many s t r u c t u r e s (cause t h e i r hydrogen bonds to b r e a k ) . These changes of course can g r e a t l y a f f e c t the course of an experiment. It would be p o s s i b l e to

Applications-2: 886

Martin

a s s o c i a t e the procedures t o c a r r y out the e f f e c t s of changed pH w i t h each of the chemicals t h a t changes i t . Alternatively, procedures might be associated w i t h each s t r u c t u r e or enzyme which could p o s s i b l y be changed by pH, and these could check f o r a l t e r e d pH whenever p r o p e r t i e s of the s t r u c t u r e or enzyme were checked, changed, or used. Neither of these a l t e r n a t i v e s appears to be as n a t u r a l or e f f i c i e n t as a t t a c h i n g a procedure to the schema f o r pH. which c a r r i e s out the a p p r o p r i a t e a l t e r a t i o n s in world s t a t e whenever pH i t s e l f i s changed. (6) PLAN CRITICS: New world s t a t e c o n d i t i o n s which may a f f e c t the plan could be detected by attached procedures which o f f e r advice on how to a l t e r the plan t o take the conditions into account. Such attached procedures could be c a l l e d "domain s p e c i f i c plan c r i t i c s " . For example, suppose t h a t sample temperature had been a l t e r e d and as a r e s u l t , c e r t a i n s t r u c t u r e s had become denatured (the e f f e c t of d e n a t u r a t i o n being c a r r i e d out by attached procedures as i n d i c a t e d above). A plan c r i t i c might be a c t i v a t e d a f t e r the e f f e c t s of temperature had been propagated through the world s t a t e . It might examine the new s i t u a t i o n and conclude t h a t the c o n t i n u a t i o n of the plan r e q u i r e d t h a t the a f f e c t e d s t r u c t u r e s be renatured (hydrogen bonds r e s e a l e d ) . It might be p o s s i b l e t o p r o v i d e the c r i t i c w i t h s p e c i f i c r e p a i r s t o s p e c i f i c problems. For example, i t might know t h a t if s t r u c t u r e s of a given type are denatured by r a i s i n g the sample t e m p e r a t u r e , and if the pH is in a given range, then the s t r u c t u r e s can be renatured by adding a c a l c u l a t e d q u a n t i t y of a c e r t a i n i o n , s o t h a t the plan can be continued. 4

Implementation

f o r accessing Many of the u t i l i t y programs the knowledge base have been and m o d i f y i n g A special editor, EDNA, for implemented, s t r u c t u r e s is a c q u i r i n g and m a n i p u l a t i n g DNA been t e s t e d and used by completed and has An i n i t i a l schema e d i t o r has been geneticists. c r e a t e d . The i n i t i a l sets of g e n e t i c o b j e c t s and t r a n s f o r m a t i o n s have been determined. The f i r s t system w i l l c a r r y out the task o f experiment c h e c k i n g . This means t h a t a set of input samples and a specific sequence of t r a n s f o r m a t i o n s w i l l be g i v e n . The system w i l l then s i m u l a t e the sequence of t r a n s f o r m a t i o n s on the r e p r e s e n t a t i o n s of the samples t e r m i n a t i n g w i t h a f i n a l set of samples. The f i n a l samples can be compared w i t h a c t u a l l a b o r a t o r y r e s u l t s as a t e s t of the i n i t i a l hypotheses or of the accuracy of the t r a n s f o r m a t i o n s in the knowledge base. This f i r s t system w i l l be used f o r debugging the t r a n s f o r m a t i o n knowledge base and by geneticists f o r comparing the predicted r e s u l t s from the MOLGEN system a g a i n s t a c t u a l l a b o r a t o r y experiments. 5

5.1

Acknowledgments

Other members of the Stanford community have c o n t r i b u t e d time and ideas to the MOLGEN p r o j e c t . We p a r t i c u l a r l y want to thank Bruce Buchanan, Randy Davis, Ed Feigenbaum, Jerry F e i t e l s o n , Joshua Lederberg, N i l s Nilsson and Terry Winograd.

6 BIBLIOGRAPHY Bobrow D.G. , Winograd T. An Overview of KRL. a Knowledge Representation Language. Cognitive Science No.1 V o l . 1 (Jan 1977) Brachman R . J . , What's in Foundations f o r Semantic No. 3433 (October 1976)

a Concept: S t r u c t u r a l Networks, BBN Report

Dahl O . J . , D i j k s t r a E,, Hoare C.A.R, Structured Programming. New York: Academic Press (1972) Davis R. A p p l i c a t i o n s of Meta Level Knowledge to the C o n s t r u c t i o n , Maintenance, and Use of Large Knowledge Bases. AI Memo 283. Computer Science Department, S t a n f o r d U n i v e r s i t y . ( J u l y 1976) F e i t e l s o n J , . S t e f l k M . J . , A Case Study of the Reasoning in a Genetics Experiment. H e u r i s t i c Programming P r o j e c t Memo HPP-77-18 (working paper) (May 1977) Goldberg A . , Kay A . , Learning Research Group. SMALLTALK-72 Instruction Manual. Xerox C o r p o r a t i o n (1976) Minsky M. A Framework f o r Representing Knowledge in . in P. P. Winston Winston (Ed.) The Psychology '^of Computer V i s i o n New York: McGra S t e f i k . M . J . , M a r t i n N. A Review of Knowledge Basea Problem S o l v i n g As a Basis f o r a Genetics Experiment Designing System, CS Report 77-596 Computer Science Department, Stanford U n i v e r s i t y . (Feb 1977) Woods W.A., What's in a L i n k : Foundations f o r Semantic Networks, in Representation and Understanding, S i d l e s i n C o g n i t i v e S c i e n c e , Bobrow and C o l l i n s ( e d s . ) (1975)

Summary

The use of a r e p r e s e n t a t i o n language i n v o l v i n g schemata and associated d e r i v e d models has been extended to i n c l u d e a l l aspects of domain knowledge and s t r a t e g y and h e u r i s t i c problem s o l v i n g knowledge. This u n i f o r m r e p r e s e n t a t i o n w i l l allow the extension of knowledge base management techniques for acquisition and r e t r i e v a l o f p r o c e d u r a l knowledge. As w i t h any problem s o l v i n g system, the success w i l l depend on the knowledge it has available. The purpose of our design is to c r e a t e a system which can f a c i l i t a t e the process of t r a n s f e r r i n g knowledge from the user to the system, and use t h i s knowledge e f f e c t i v e l y in problem s o l v i n g .

Applications-2: 887

Martin

Knowledge Base Management for Experiment Planning in ... - CiteSeerX

Knowledge Base Management for Experiment Planning in ... - CiteSeerX

Suggest Documents

Building knowledge base management systems - CiteSeerX

Knowledge Base Systems - CiteSeerX

knowledge management and succession planning - CiteSeerX

Knowledge Base Management Systems and The Knowledge ...

Presentation planning using an integrated knowledge base

Presentation planning using an integrated knowledge base

KNOWLEDGE BASE REPRESENTATION FOR AN ... - CiteSeerX

Desktop Experiment Management - CiteSeerX

Knowledge Management Support for Enterprise Resource Planning ...

a knowledge base for knowledge-based multiagent ... - CiteSeerX

Knowledge Management in Process Planning - Semantic Scholar

Knowledge Management in Enterprise Resource Planning Systems

Knowledge Management in Enterprise Resource Planning Systems ...

software requirements for knowledge management in ... - CiteSeerX

Experiment Management Support for Performance Tuning ... - CiteSeerX

effective management - Crop Genebank Knowledge Base - CGIAR

Knowledge Base

Knowledge Management and Knowledge Management ... - CiteSeerX

Knowledge Management in SMEs - CiteSeerX

Standardisation in Knowledge Management - CiteSeerX

Representing Spacecraft Mission Planning Knowledge in ... - CiteSeerX

Knowledge Base for RTD Competencies in IST

Knowledge Decay in a Normalised Knowledge Base

Performing Business Processes Knowledge Base 1 ... - CiteSeerX