Molecular Verification of Rule-Based Systems Based on DNA ...

18 downloads 0 Views 734KB Size Report
algorithm that can efficiently verify rule-based systems. The proposed .... In Section 4.2, we discuss how the strong model of Amos. [10] can be used as a basis ...
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 20,

NO. 7,

JULY 2008

965

Molecular Verification of Rule-Based Systems Based on DNA Computation Chung-Wei Yeh and Chih-Ping Chu Abstract—Various graphic techniques have been developed to analyze structural errors in rule-based systems that utilize inference (propositional) logic rules. The four typical errors in rule-based systems are 1) redundancy (numerous rule sets resulting in the same conclusion), 2) circularity (a rule leading back to itself), 3) incompleteness (dead ends or a rule set conclusion leading to unreachable goals), and 4) inconsistency (rules conflicting with each other). This study presents a new DNA-based computing algorithm mainly based on Adleman’s DNA operations. It can be used to detect such errors. There are three phases to this molecular solution: rule-toDNA transformation design, solution space generation, and rule verification. We first encode individual rules by using relatively short DNA strands and then generate all possible rule paths by the directed joining of such short strands to form longer strands. We then conduct the verification algorithm to detect errors. The potential of applying this proposed DNA computation algorithm to rule verification is promising, given the operational time complexity of Oðn  qÞ, in which n denotes the number of fact clauses in the rule base, and q is the number of rules with the longest inference chain. Index Terms—DNA computing, rule-based system, rule verification.

Ç 1

INTRODUCTION

C

OMPRISED of facts and rules acquired from experts, rulebased systems have become an integral feature of expert systems and have been used extensively in operational research. According to several studies [1], [2], [3], [4], [5], [6], structural errors (incompleteness, circularity, conflict, and redundancy) exist in rule-based systems. These errors are mostly a result of rule refinement generated by experts. Thus, many techniques such as Petri nets [1], [6], [7] and graphs [4], [5] have been created to identify these errors. DNA (deoxyribonucleic acid) computing, which utilizes parallel computing, can be used to solve large problems. This work introduces an alternative molecular computing solution for verifying rule-based systems in the general form. Adleman [8], who solved a Hamiltonian path problem (HPP) for a directed graph with seven nodes, demonstrated the efficacy of using molecules in a solution to solve computational problems. Lipton solved a satisfiability (SAT) problem to demonstrate the advantage of using the massive parallelism inherent in DNA-based computing. Current studies [9] suggest the possibility of representing 1018 bits of information as 1018 DNA strands, which can be contained in a test tube. Moreover, 1018 bits of data can be processed in parallel by utilizing basic biological operations. We use DNA computation in this study to develop an algorithm that can efficiently verify rule-based systems.

. The authors are with the Department of Computer Science and Information, National Cheng Kung University, No. 1, Ta-Hseuh Road, Tainan, Taiwan, ROC. ǯǯȱŽ‘ȱ’œȱ ’‘ȱŽ™ȱ˜ȱ ǰȱ Š˜ȱžŠ—ȱ—’ŸŽ›œ’¢ǯ E-mail: [email protected], [email protected]. Manuscript received 24 Jan. 2006; revised 3 Aug. 2006; accepted 6 Dec. 2007; published online 19 Dec. 2007. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-0033-0106. Digital Object Identifier no. 10.1109/TKDE.2007.190743. 1041-4347/08/$25.00 ß 2008 IEEE

The proposed approach involves incubating the solution space containing DNA strands and presents a DNA-based algorithm for generating a general solution based on the operations defined in Adleman’s experiments. The remainder of this paper is organized as follows: Section 2 briefly describes the structural errors of rule-based systems. Section 3 introduces DNA computation and presents the DNA algorithm used for rule verification. Section 4 utilizes the proposed algorithm in operations and analyzes its complexity. Section 5 concludes the report.

2

STRUCTURAL ERRORS

IN

RULE VERIFICATION

The IF-THEN rule in a rule-based system is typically formed as X ! Y (Fig. 1a), in which X is an antecedent fact clause (node), and Y is a conclusion node. Two nodes, atomic and compound, exist in rule bases. One study [5] found that compound antecedents in rules are only present in the conjunction format such as dIf ðX1 and X2 . . . and Xn Þ then Y c and that only the atomic node is allowed to be a conclusion (Fig. 1b). An antecedent in a rule base can acquire multiple conclusions as a separate rule corresponding to each individual conclusion with same antecedent (Fig. 1c). The proposed verification algorithm is derived by detecting structural errors among rule sets. As described in the work of Ramaswamy et al. [5] and Yang et al. [6], the following are typical errors: 1.

Incompleteness. When a rule graph conclusion is not the goal, the rule sets are incomplete (Fig. 2a). This type of error occurs due to any of the following: . . .

A rule conclusion is not a goal. Rules are irrelevant. The rule condition is missing.

Published by the IEEE Computer Society

966

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 20,

NO. 7, JULY 2008

Fig. 2. (a) Incomplete rules. (b) Circular rules. (c) Conflicting rules. (d) Redundant rules.

Fig. 1. (a) Basic rule. (b) Compound antecedent rule. (c) Multiple conclusion rules.

2.

3.

4.

3

Circularity. When an antecedent leads back to itself or to its compound antecedent subset in a rule graph, the rule sets are circular (Fig. 2b), indicating that inference rules are circularly dependent and generate a deadlock. Conflict. Rule conflict arises when the same antecedent results in mutually exclusive conclusions. Given that the conflict specifies that rules lead to contradictory conclusions (Fig. 2c), such errors are difficult to identify in massive rule bases. Redundancy. When rule sets result in the same conclusion in a rule graph, these rule sets are redundant (Fig. 2d). Redundancy commonly occurs when different inference rule paths lead to an intermediate or a final conclusion in a rule graph.

DNA SOLUTION

DNA computing embedding massive parallelism may become a strong competitor to electronic computing, since there are barriers to the continued development of traditional silicon-based computers [10]. The mechanism behind the proposed DNA approach for detecting rule errors is based on the biological reactions of DNA. There are seven constraints of the strand library design proposed by Braich [11], which shows that the DNA sequence must be designed to ensure that the strands have little secondary structure that might inhibit the intended probe-library hybridization. The design must also exclude sequences that might encourage unintended probe-library hybridization. Thus, a relatively short single-stranded DNA molecule (an oligonucleotide or simply “oligo”) of unique sequence is designed to represent each node and each “splint” (as described in the following) based on the above-mentioned constraints. When the appropriate DNA strands are in a tube, short single-stranded node oligos can be brought together and then covalently joined with a defined polarity (using the appropriate “splint” and an enzymatic reaction) into longer single-stranded DNA molecules via the temporary establishment of complementary (double-stranded) base-pairing over the length of the splint. The result is the encoding of each rule path as a long single-stranded DNA molecule comprised of the appropriate nodes that have been covalently linked into a linear chain whose directionality is defined by the polarity ð50 -30 Þ of the DNA itself. All these chains comprise a possible solution space T . We then design the VeriRuleDNA algorithm, by using only practical operations, for filtering out (eliminate) the error rule paths

from space T . This way, the solution rule sets can be generated, and the rule sets with any of four types of errors can be removed. Two fundamental assumptions in DNA computation are that the data can be encoded in DNA strands and that molecular biological operations can be used to conduct all computational operations. In Section 3.1, we briefly introduce the basic structure of the DNA molecule and then describe the biological operations that are applied in our approach. A common concern in the evaluation of the time complexity of DNA algorithms is that every DNA operation requires a certain amount of time to implement. The construction of the initial sets of strands and reading out the final solution are certainly time consuming as well [10]. In Section 4.2, we discuss how the strong model of Amos [10] can be used as a basis for evaluating the time complexity for our algorithm.

3.1 Background of DNA Computation The DNA molecule acts as the basis of DNA-based computation [12]. Although the physicist Richard Feynman first proposed the construction of submicroscopic computers in 1961 [13], it was not until 1994 that Adleman succeeded in manipulating DNA strands to solve an HPP in a test tube. In this regard, many DNA algorithms have been developed to solve NP-complete problems. Section 3.1.1 describes the structure of DNA and the basic anneal operation on DNA as well. The operations of molecular computation are listed in Section 3.1.2. 3.1.1 The Structure and Anneal Operation of DNA DNA is a linear polymer of nucleotide monomer units. Each nucleotide is comprised a heterocyclic base attached to a deoxyribose sugar. Consecutive nucleotides in a DNA strand are linked by a phosphodiester bond between the 30 hydroxyl (OH) group on one sugar and the 50 hydroxyl group on the next sugar. The polarity of a single DNA strand is determined by the fact that one end has a terminal nucleotide with a free 50 -OH (and is hence the 50 -end), whereas the opposite end of the strand has a terminal nucleotide with a free 30 -OH (and is hence the 30 -end). The information content of the DNA is encoded by the 50 -30 sequence of consecutive bases in the polymer, which may be adenine (A), cytosine (C), guanine (G), or thymine (T). By convention, the 50 -30 sequence of bases in the DNA is written from left to right, unless indicated otherwise. Information replication and transfer via DNA can be achieved, because the DNA naturally forms a doublestranded molecule. This occurs as a result of specific reversible hydrogen bond formation between sterically complementary “base pairs,” either {A and T} or {C and G}, that directly face each other but are on the two oppositely oriented strands of DNA that make up the double-stranded molecule (Fig. 3b). Thus, the two strands in

YEH AND CHU: MOLECULAR VERIFICATION OF RULE-BASED SYSTEMS BASED ON DNA COMPUTATION

967

Fig. 4. (a) Two single-stranded DNAs s1 , s2 and the splint. (b) Annealing s1 and s2 .

can be implemented reliably in real experiments. The implementation of the Extract operation is described as follows, whereas the implementations of the other operations are described in detail elsewhere [9], [10], [16]: Fig. 3. (a) Representation of the single-stranded DNA. (b) Structure of the double-stranded DNA. (c) DNA denaturing and annealing [10].

.

the double-stranded DNA are referred to as the reverse complement (or simply the complement) of each other. DNA computations commonly apply a specific sequence of biological operations in tubes to solve a problem. Each tube’s contents are a set of molecules of DNA whose strands can be represented by a multiset of finite strings over {A, C, G, T}. When a solution of single strands is cooled, annealing allows complementary strands to bind together (Fig. 3c). To anneal the 50 end of a single-stranded DNA molecule (strand s2 ) to the 30 -end of another single-stranded DNA molecule (strand s1 ) in tube T , we hybridize a set of specific splint oligos, each 20 monomers (nucleotides) in length. Typically, the splint consists of the complement of the 10 nucleotides at the 30 -end of “s1 ” and the complement of 10 nucleotides at the 50 -end of strand “s2 .” After the appropriate annealing reactions, each splint creates a region of double-stranded sequence that would allow a “DNA ligase” enzyme to affect the molecular connection reaction (Fig. 4).

3.1.2 Operations on DNA Computation Solution-based and surface-based DNA computations are the two major models in DNA computing research. Surfacebased technology uses DNA strands immobilized on a surface (e.g., a defined region on a microscope slide or the surface of a small well in a multiwell plate), while solutionbased DNA computations are carried out in test tubes. The surface-based technology eliminates the loss of strands, which could reduce errors in the computation [14]. Solution-based DNA computation was inspired by the pioneering work of Adleman [8] and has been widely adopted as a potentially useful approach for solving NPclass problems [8], [9], [10], [15], [16], [17]. Many of the solution-based DNA models use common abstract operations (e.g., Union). Here, we use the basic DNA strand manipulation operations that are allowed and defined in Adleman’s experiments [8], the Parallel Filtering Model of Amos [10], and an alternative filtering-style model (the Sticker Model) developed by Roweis et al. [16]. The operations defined by Amos have the advantage of being based on very precise and complete chemical reactions that

.

.

. . . .

Extract ðT ; s; T þ ; T  Þ. With the test tube represented as T and the DNA strand as “s,” the operation is performed to extract all strands that contain s from the original tube T to yield a new tube T þ . The remaining strands in tube T are poured into another tube T  . According to Adleman [8], the Extract operation is physically implemented as follows: Oligos S, whose sequence is the reverse complement of s, are prepared with a biotin residue extending from each oligo. A bead matrix whose surface is coated with covalently attached avidin protein molecules (which bind biotin very strongly) is then used as a solid support, to which the biotinylated S oligos are attached. Tube T is first heated to ensure that all of the DNA strands in it are single stranded. The strands that contain s are then removed by passing the contents of T over a separation column containing the bead matrix and immobilized S oligos described above. The column is then washed, which is performed to remove into a new tube T  all of the strands in T that do not contain s. This leaves behind, on the column, only those strands that contain s, which can then be recovered (by denaturation) from the separation column into a new tube T þ [8]. Separate ðT ; s; k; Ton ; Toff Þ. This operation separates strands into Ton and Toff . With “s” representing nucleotides of length l, if strands in tube T contain “s” starting from position k, they are poured into Ton ; otherwise, they are poured into Toff [16]. Union ðT0 ; T1 ; T2 ; . . . ; Tn Þ. The Union operation is performed by mixing the DNA strands from n tubes, T1 ; . . . , and Tn and putting them into one tube T0 . This process empties tubes T1 ; T2 ; . . . ; Tn [10]. Copy ðT; T1 ; T2 ; . . . ; Tn Þ. In parallel, this operation produces copies T1 ; T2 ; . . . ; Tn of tube T [10]. Read T . This operation describes every DNA strand in tube T [9]. Remove T . A test tube is denoted T . The Remove operation discards all strands in tube T and empties tube T [9]. Detect T . Given a test tube T , the Detect operation is performed to check whether tube T contains at least one DNA strand. It returns Y if it contains DNA strands; otherwise, it returns N [9].

968

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Fig. 5. Overview of a DNA-computation-based approach for detecting structural errors in a rule base.

3.2

DNA Algorithm for Rule-Based System Verification In this section, we propose the use of solution-based DNA computation as an approach to render an error-free rule base that could not otherwise prevent the occurrence of the four typical structural errors described earlier (Fig. 2). Fig. 5 presents an overview of the approach, which is divided into three phases: rule-to-DNA transformation design, solution space generation, and rule verification. In the first phase, components of each inference rule are represented by specific DNA strands. In the second phase, five presteps and an Init procedure are used to process the strands to establish an initial solution space based on the strand definitions adopted in phase 1. The third phase then processes the strands to remove those that represent rules with circularity, conflict, or redundancy. The VeriRuleDNA algorithm was developed to execute the procedures in the second and third phases. Detailed explanations are given in the following for the VeriRuleDNA algorithm (Section 3.2.1) and the DNA strand definitions (Section 3.2.2). Sections 3.2.3 and 3.2.4 generate a solution space containing rule sets that make the goal reachable and are without circularity, and Section 3.2.5 presents procedures for detecting conflict and redundancy. 3.2.1 DNA Algorithm for Verifying the Rule-Based Systems In this approach, the computation starts with the solution space T . The construction of T and the procedures for performing each step of the algorithm are explained in Sections 3.2.2-3.2.5. Algorithm VeriRuleDNA Step 1: Init ðT ; X1 ; Xn Þ Step 2: DetectCir ðT ; Tr3 ; Xi Þ Step 3: DetectConflict ðT ; Xi ; Xj ; prÞ Step 4: DetectRedundancy ðT ; Tr3 ; Xi Þ Step 5: DetectConSys ðT ; TReq ; Xi ; Xj Þ Step 6: Detect yes ðT Þ: If Detect ððT Þ ¼ ðYÞÞ then Read ðT Þ

VOL. 20,

NO. 7, JULY 2008

Fig. 6. (a) Three distinct strands representing three facts. (b) Two distinct strands representing the “” operator and two distinct strands representing “! ” operator. (c) The resulting complexes (c-1, c-2) both representing rule R1 .

3.2.2 Representing Inference Rules by DNA Strands For a general inference rule R : ðX1  . . . Xi . . .Þ ! Xj in a domain of n nodes, each antecedent node (denoted as AN) Xi and conclusion node (denoted as CN) Xj , here called operands, are usually represented by a DNA strand that is 20 nucleotides long (based on the rationale in [8]), respectively. A general inference rule contains two types of relationships: “” and “! .” Two tetranucleotide sequences, “AAAA” and “CCCC” (and their complements), can be incorporated into the strands used for DNA computation to enforce the relationships “” and “! ,” respectively. The presence of these tetranucleotide sequences does not violate the specifications for DNA strand design that are described in [11]. Thus, different DNA splints that are 24 nucleotides long (and contain the appropriate tetranucleotide sequence indicated above in the middle of the splint) can be designed to enforce “xi xj ” and “xi ! xj ” relationships. The “xi ! xj ” operator is a splint whose DNA sequence (in the 30 -50 direction) consists of the complement of the 10 nucleotides at the 30 -end of the strand node Xi , four nucleotides of “CCCC,” and the complement of the 10 nucleotides at the 50 -end of strand node Xj . The “xi xj ” operator is a splint whose DNA sequence (in the 30 -50 direction) consists of the complement of the 10 nucleotides at the 30 -end of the strands Xi , four nucleotides of “AAAA,” and the complement of the 10 nucleotides at the 50 -end of strand Xj . There are k! strands representing the same rule of k-node compound AN, because all permutations of k nodes are allowed to represent the rule. In order to detect circularity and redundancy in the rule set, the CN for one of the k! strands chosen randomly is poured into the tube Tr1 for each rule of compound AN (as will be discussed in Section 3.2.3). For instance, the rule R1 shown in Fig. 6 is considered as a rule of two compounds AN. 3.2.3 Solution Space for Generating the Complete Graph Two tasks proceed to generate the solution space containing rule paths leading to the goal node: the creation of tube T and the application of the Init procedure. The first task, which forms the desired initial multiset strands, is achieved in five presteps (Fig. 7):

YEH AND CHU: MOLECULAR VERIFICATION OF RULE-BASED SYSTEMS BASED ON DNA COMPUTATION

969

Fig. 7. Forming the initial multiset strands.

Prestep 1: Initialization. We begin with tubes T , T! , Tr1 ,

or Xn , copies of strands designed as 1) “GGGG” followed

Tr2 , Tr3 , Tc , and Ts . The strands representing operators

by the 10 nucleotides at the 50 -end of the node X1 and

“x y ” and “x !y ” are poured into T and T! , respectively.

2) 10 nucleotides at the 30 -end of the node Xn followed by

In order to prevent circularity from occurring to nodes X1

“GGGG” are poured into tube T! . If any splint in T!

970

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

anneals to the above strands, it is filtered out. Since tube Tr1 (Section 3.2.2) contains the CN of rules with compound AN, the CN for each rule strand of atomic AN is then poured into tube Tr2 . In order to identify which CN has more than one rule leading to it, strands in Tr2 are then poured into Tr1 to make sure that Tr1 contains the CN for all rules. We then only pour one copy of the complements for all CN into Tr1 . Any CN in Tr1 that does not anneal to its complement is poured into Tr3 . Thus, each element of the set of nodes in Tr3 has more than one “! ” operand leading to it. Subsequently, copies of all AN are poured into Tc to process compound AN. Prestep 2: Processing of tubes Tc and Ts . Strands from T that act as splints are poured into Tc to anneal to compound AN. Copies of strands “TTTT” are poured into Tc , and DNA ligation is allowed to occur. Therefore, each set of compound AN in Tc would stick together to become double strands. We then separate the single strands from Tc to Ts . Prestep 3: Splint-mediated covalent joining (ligation) of strands representing compound AN. To process the situation that one of the compounds AN of a rule is the CN of another rule with compound AN, we pour a population of strands from T! to tube Tc . Strands from T! acting as splints bound the abovementioned strands together to form long double DNA sequences, which are subsets of rule chains. Prestep 4: Ligation of strands to form rule-chain subsets containing atomic AN. Copies of “GGGG” and the strands from Ts are poured together into Tc to anneal strands, each long strand representing one set of possible inference rules that may contain typical structural errors. Prestep 5: Denaturing tube T . The double-stranded DNAs from Tc are then denatured into single strands and poured into tube T . All possible rule sequences are represented in T . An example with four of the five presteps mentioned above is depicted in Fig. 7. The following algorithmic description initially inputs into tube T all strands encoding possible operands Xi and their connected operators. The pseudocode for the verification of completeness of the rules can be expressed by verifying the presence of strands that begin with the starting node X1 and terminate with the ending node Xn . Procedure Init ðT T ; X1 ; Xn Þ 1. Input ðT ; X1 ; Xn Þ 2. Extract ðT ; X1 ; Ty ; Tdrop Þ 3. Extract ðTy ; Xn ; T ; Tdrop Þ 4. Remove ðTdrop Þ To extract all strands encoding X1 and Xn from T , we implement the Extract operation ðT ; X1 ; Ty ; Tdrop Þ by adding many copies of a primer corresponding to X 1 into tube T . The primers only anneal to single strands containing the X1 . The extraction (line 2) forms a new tube Ty consisting of strands containing X1 . There is no splint to complement the 10 nucleotides at the 50 -end of X1 . Therefore, if a strand contains subsequence X1 , then it has to be located at the front of the strand. We implement the second Extract operation ðTy ; Xn ; T ; Tdrop Þ by adding many copies of a primer corresponding to X n to tube Ty . Similar to Extract ðT ; X1 ; Ty ; Tdrop Þ, if a strand contains subsequence Xn , then Xn has to be located at the end of the strand. As we shall

VOL. 20,

NO. 7, JULY 2008

see, strands in T construct a complete path, starting with node X1 and ending with node Xn .

3.2.4 The Circularity Detection of the Complete Graph Designed to detect circularity, the DetectCir procedure removes strands in which the same node appears in at least two locations. Only strands containing the nodes existing in Tr3 will likely have a circularity problem. Using z to represent the number of nodes in tube Tr3 , the procedure can be thus represented as follows:

Initially, z copies of T are generated as tubes Ti . Four extra tubes Tiþ , Tif , Ticir , and Titmp are also necessary. Tube Tif is designed to be strand set that does not need to be compared. If a strand from Ti is encoded as Xn at position ðq  24Þ, then it is poured into Tif . Strands from Ti and Tiþ are extracted to detect the occurrence of Xi at position ðq  24Þ within a loop. Within the qth iteration of the while loop between lines 4 and 10, the strands containing the first occurrence of node Xi are poured from tube Ti into Tiþ , and strands in each tube Tiþ containing strands of more than one occurrence of Xi are poured from tube Tiþ into tube Ticir in parallel. Thus, each tube Ticir contains strands representing circular rule paths when considering node i. At the end of execution, tubes Ticir are merged into Tcir . We then extract the Tcir strands from T . Thus, the strands that remain in T should be circularity free.

3.2.5 The Conflict and Redundancy Detection of the Complete Graph Since the conflict nodes in the rule base are defined as known conditions in the system, we defined the conflict nodes as pair ðXi ; Xj Þ. Each of the nodes Xi and Xj is represented by a distinct DNA strand that is 20 nucleotides long. The complements of each conflicting node pair X i and Xj are created and stored as an input to procedure DetectConflict. Starting from the first pair ðpr ¼ 1Þ, to identify whether tube T contains Xi or Xj , one strand of the complement pair is added into T , and the other is then added into T while executing DetectConflict. The procedure is given as follows:

YEH AND CHU: MOLECULAR VERIFICATION OF RULE-BASED SYSTEMS BASED ON DNA COMPUTATION

971

Fig. 8. Flowchart for the detection of redundancy.

Procedure DetectConflict ðT T ; X i ; X j ; pr prÞ 1. Input ðT ; Xi ; Xj Þ 2. ExtractðT ; Xi ; T þ ; T  Þ 3. ExtractðT þ ; Xj ; Tdrop ; T1 Þ 4. ExtractðT  ; Xj ; T2 ; T3 Þ 5. Copy ðT1 ; Tpr1 Þ, Copy ðT2 ; Tpr2 Þ 6. Union ðT ; T1 ; T2 ; T3 Þ 7. pr ¼ pr þ 1 The conflicting node sets are defined, and copies of the complement sets are created. For each conflict node Xi , we split T into two sets T þ and T  , where T þ contains only strands containing Xi , and T  does not. Strands in T þ containing the conflict node Xj are poured into Tdrop to be discarded; otherwise, they are poured into T1 (line 3). Strands in T  are extracted to find out whether T  contains Xj (line 4). Strands in T1 and T2 are considered mutually exclusive, because they conflict with each other as a result of the node pair ðXi ; Xj Þ. Extra copies of Tpr1 (copied from tube T1 ) and Tpr2 (copied from tube T2 ) are then created for each conflicting pair ðXi ; Xj Þ. We then create a new set T by combining T1 , T2 , and T3 (line 6). These steps are executed repeatedly if other conflicting node pairs exist. Strands present in tube Tdrop contain conflicting nodes and hence are invalid. In particular, strands containing nodes that conflict with system-required rules need to be removed (as will be discussed in the following). As described in Fig. 2, redundant rules occur when alternative rules with the same CN exist in a rule base. However, some redundant rule paths could contain nodes that participate in compulsory rule paths, a possibility that creates additional complications. To resolve this problem, we adopt the parallel filtering model of Amos [10] and apply the concept of “strand rule relationship.” Formally, given a particular “redundant node,” strand sets are classified based on irrelevancy, redundancy, and requirement. Irrelevancy refers to set of strands that are irrelevant to the strands that are required for a specific redundant

node. Redundancy refers to the set of strands that act as alternatives to the strands that are required for a specific redundant node. Requirement refers to the set of strands that are required for some rule sets that contain a specific redundant node.

Initially, copies of T are generated as tubes Ti . Four extra tubes Tiþ , Ti , TiRun , and TiReq are also necessary for each redundant node. The strands containing the Xi node are extracted from each tube Ti into Tiþ ; otherwise, they are extracted into Ti by the parallel Extract operation (line 2). The strands containing “! Xi ” are then extracted from each tube Tiþ into TiRun ; otherwise, they are extracted into TiReq by the parallel Extract operation (line 3). In order to represent such a rule relationship for each redundant node, strands remained in tubes Ti , TiRun , and TiReq . Note that if a rule chain contains the redundant node Xi but lacks a node that directly infers Xi , then Xi must be part of the compound AN in the rule chain. Therefore, this rule chain needs to exist when considering redundant node Xi . We conclude that if a strand chain, for example, called StrR, contains a specific redundant node Xi , then the strands in Ti are irrelevant to StrR. The strands in TiRun are a redundant set to StrR. The strands in TiReq are a required set to StrR. The flowchart is depicted in Fig. 8. A rule path must be removed if it contains nodes that conflict with nodes of system-required rules required for completeness. To effect this removal, we first select the redundant node (numbered rc) that is nearest the goal node. Strands in tube TrcReq are system-required rules and will be

972

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 20,

NO. 7, JULY 2008

TABLE 1 Result Generated by Executing the Procedure DetectCir ðT ; Tr3 ; Xi Þ

Fig. 9. Schematic for rule base R (illustrating a rule base containing the four errors).

poured from TrcReq to TReq . Therefore, the procedure DetectConSys is constructed to detect conflicts and is described as follows:

Proof. The algorithm is used to verify the rule-based system and detect the four structural errors. A rule base R is defined as follows, where Rk identifies every specific rule: R ¼fR1 : X2 ! X3 R2 : ðX1 \ X2 Þ ! X5 R3 : X1 ! X8 R4 : ðX1 \ X2 Þ ! X4 R5 : X4 ! X11 R6 : X1 ! X6 R7 : X4 ! X7 R8 : X7 ! X4 R9 : X5 ! X8 R10 : X5 ! X9 R11 : ðX8 \ X9 Þ ! X10 R12 : ðX6 \ X8 Þ ! X10 R13 : X10 ! X13 R14 : X12 ! X10

R15 : ðX10 \ X11 Þ ! X14 g: u t

Assuming that there are c pairs of conflicting nodes, c copies of Tir are created. The complements of the strands encoding the conflicting nodes Xi are poured into Tir . If Xi is detected in Tir , it indicates the existence of essential rules that require a node that is a member of a conflicting pair of nodes. Hence, we must remove strands that contain the other conflicting node Xj from T and from all required and redundant tubes in parallel.

4

CORRECTNESS

AND

COMPLEXITY ANALYSIS

The following theorem allows DNA computation to verify the rule-based system. Additionally, a corollary is presented to determine algorithm complexity.

4.1

An Example of the Use of VeriRuleDNA to Detect Errors Theorem 1. The verification of four typical structural errors for a rule base with n nodes can be resolved using the VeriRuleDNA algorithm.

The description of strands in Section 3.2.2 indicates that before VeriRuleDNA is executed, the strands of n nodes (operands) and operators are created to represent the given rules, where n is the number of nodes ðn ¼ 14Þ. Nodes X1 and X2 are defined as starting nodes, and node X14 is defined as an ending node, respectively (Fig. 9). By generating a solution space by the two tasks described in Section 3.2.3, Step 1 produces all possible complete strands set, starting with X1 and X2 and ending with X14 . Step 2 excludes the strands representing rule paths that contain circularity (circularity strands) by executing the Separate operation in parallel within a loop (Table 1). The strands fX1 X2 ! X4 ! X7 ! X4 ! X11 X10 ! X14 ; X2 X1 ! X4 ! X7 ! X4 ! X11 X10 ! X14 g are extracted from T4þ and are poured into T4cir at q ¼ 4. After q iterations, circularity strands are poured into Tcir . Thus, strands fX1 X2 ! X4 ! X7 ! X4 ! X11 X10 ! X14 ; X2 X1 ! X4 ! X7 ! X4 ! X11 X10 ! X14 g are removed from T . When detecting conflicts (Step 3), nodes X4 and X9 are assumed to be the only nodes ðXi ; Xj Þ ¼ ðX4 ; X9 Þ conflicting in the sample (marked as concentric circles in Fig. 9). When executing procedure DetectConflict, strands denoting fR4 R5 R15 g are poured into T1 . Strands in T1 are amplified and poured into T1-1 . Strands of fR3 R11 R15 ; R2 R9 R11 R15 ; R2 R10 R11 R15 g are poured into T2 . Strands in T2 are amplified and poured into T1-2 .

YEH AND CHU: MOLECULAR VERIFICATION OF RULE-BASED SYSTEMS BASED ON DNA COMPUTATION

973

Fig. 10. Redundancy detection.

To perform a redundancy check (Step 4), we find two redundant nodes X10 and X8 in Tr3 and then make two extra copies of T , X10 , and X8 . Therefore, there are four tubes for X10 and X8 , as illustrated in Fig. 10. Starting backward from node X14 , we extract the first redundant

node ðX10 Þ from T10 and then extract the strands representing rule set fR6 R12 R15 ; R2 R9 R11 R15 ; R3 R12 R15 ; R2 R10 R11 R15 ; R3 R11 R15 ; R2 R9 R12 R15 g

974

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Req Run to T10 and fR4 R5 R15 g to T10 . Because X10 is the Req redundant node nearest the goal, strands in T10 are system required. Strands representing rule set fR4 R5 R15 g are system required. Strands fR6 R12 R15 ; R2 R10 R11 R15 g are required for the redundant rules set

fR3 R11 R15 ; R3 R12 R15 ; R2 R9 R11 R15 ; R2 R9 R12 R15 g when considering node X8 . Finally, to check whether the rules conflict with system requirements (Step 5), strand paths containing node X9 , namely, fR2 R10 R11 R15 ; R3 R11 R15 ; R2 R9 R11 R15 g, are removed from tube T and from all required and redundant tubes.

4.2 Complexity Analysis To represent rules and detect the structural errors mentioned in Section 2, recent studies of error-detecting techniques have focused on graphs [2], [4], [5] and Petri nets [1], [6], [7], unlike early studies that focused on pairwise rule checking [18]. Nguyen et al. [18] discussed potential knowledge base problems and described CHECK, a program for knowledge base completeness and consistency verification using the Lockheed Expert System (LES). Moreover, Nguyen et al. mentioned that their program detects problems only if the rule syntax was sufficiently restrictive to allow the examination of two rules. The program determines whether situations exist, in which both rules can succeed, and whether the results of applying the two rules are identical, conflicting, or unrelated. Nazareth [1] noted that pairwise rule comparison is inefficient for large systems with chained errors. The approach proposed by Ramaswamy [5] using directed hypergraphs to verify the rule-based system is based on the adjacency matrix with column and row revisions. This adjacency matrix technique provides a time complexity of Oðqm3 þ 4qm2 r2 Þ, where m is the number of rules in the rule base, r is the maximum number of conjuncts in any rule, and q is the number of rules with the longest inference chain. He et al. [7] applied !-nets, a special class of low-level Petri nets, to detect the structural errors in a rule-based system. When applying Petri nets to model a rule base, transitions are used to represent rules, while the input and output places of a transition are used to represent conditions and conclusions of the rule, respectively. Transitions that are enabled mean that rule conditions are satisfied and the associated rules are activated. An algorithm for generating a reachability graph with the worst-case computational complexity of Oð1=2ðn  m2 ÞÞ for the reachability graph is shown in [7], where m denotes the number of inference rules, and n is the total number of unique places in the rule base. Several theorems in [7] were used to demonstrate that generating graphs can detect the mentioned structural errors. In their approach, the rule corresponding to the transition ti not resulting in a new marking of !-net is treated as either redundancy or circularity: The other rules leading to ti are not considered part of the redundancy. The algorithms used in this approach thus did not strongly emphasize redundant set selection. Alternatively, all rule paths leading to the same marking can be considered redundant sets, in addition to only considering the rule corresponding to ti . Our approach

VOL. 20,

NO. 7, JULY 2008

thus collects the redundant rule sets and analyzes the required rule path(s) for each redundant node. Additionally, this study determines whether the redundant sets are required by the system. If they are required, the relative paths then become required (not redundant), even if the firing of the rule(s) does not generate a new marking when considering !-nets. The rules comprising the strand paths in Treq are required by the system. Subsequently, depending on the redundant node, the redundant rules or those required for the redundant node i are stored in TiRun and TiReq , respectively. The complexity of a given DNA computation algorithm is usually determined by its biological steps. To analyze complexity, we use the strong model of parallel filtering of DNA computation [10], which can realistically measure the complexity by taking into consideration that the basic operation is dependent on the actual problem size instead of taking constant time [17]. The operation time for some of the fundamental operations defined in Section 3 can be found in [10] and are summarized as follows: The Union ðT ; T1 ; T2 ; . . . ; Tn Þ and Copy ðT ; T1 ; T2 ; . . . ; Tn Þ operations take OðnÞ time. The VeriRuleDNA algorithm comprises six principal steps: 1. Init ðT ; X1 ; Xn Þ, 2. DetectCir ðT ; Tr3 ; Xi Þ, 3. DetectConflict ðT ; Xi ; Xj ; prÞ, 4. DetectRedundancy ðT ; Tr3 ; Xi Þ, 5. DetectConSys ðT ; TReq ; Xi ; Xj Þ, and 6. Detect_yes T . We assume that the initial tube T (which takes the most linear time to build) is already constructed. The Init ðT ; X1 ; Xn Þ, consisting of two Extract and one Remove operation, takes OðnÞ time. DetectCir ðT ; Tr3 ; Xi Þ, consisting of ð3  qÞ Separate, q Detect, one Copy, one Remove, and ðq þ 1Þ Union operations, takes Oðq  nÞ time. DetectConflict ðT ; Xi ; Xj ; prÞ, consisting of three Extract, two Copy, and one Union operation, takes OðnÞ time. DetectRedundancy ðT ; Tr3 ; Xi Þ, consisting of one Copy and two parallel Extract operations, takes OðnÞ time. DetectConSys ðT ; TReq ; Xi ; Xj Þ, consisting of one Copy, one parallel Union, four parallel Extract, and one parallel Detect operation, takes OðnÞ time, and Detect_yes T , consisting of one Detect operation and one Read operation, takes O(1) time. Based on the proposed algorithm, the biological operations of VeriRuleDNA are determined to be Oð23 þ 5qÞ by using the strong parallel model. Therefore, the time complexity is Oðn  qÞ. Indeed, further analysis of the time complexity may be achieved by experimental means, which is a possible future work. Corollary 1 (of Theorem 1). The verification of a rule-based system with four structural errors can be solved using Oð23 þ 5qÞ biological operations and Oðn  qÞ time complexity, where n is the number of nodes in the rule base, and q is the number of rules with the longest inference chain. Proof. Refer to the above.

u t

YEH AND CHU: MOLECULAR VERIFICATION OF RULE-BASED SYSTEMS BASED ON DNA COMPUTATION

5

CONCLUSION

The role of DNA computing technology in applications has increased significantly in the last few years. The proposed algorithm VeriRuleDNA is the first to demonstrate that verification of complicated rule-based systems can be done by using a DNA-based algorithm with computation efficiency. The method described here mainly transforms rules into strands, constructs a rule-path space, and then employs basic biological operations to generate procedures for detecting structural errors among the rule sets. In this paper, the VeriRuleDNA algorithm is a scheme that utilizes an entirely linear increase in computation with time for existing n nodes. Two major features of the proposed approach are summarized as follows: 1.

2.

The time complexity of the proposed algorithm has been achieved. As proven in Corollary 1, the time complexity is Oðn  qÞ. By applying our proposed algorithm to a complicated rule-base verification system, we have shown that existing computational problems of significant complexity may be solved efficiently by using DNA computation. The technique may be applied to other complex applications. Since the completeness of the algorithm is confirmed, this framework suggests that DNA strands can be designed to represent rule information. Algorithmic approaches that utilize biological operations on DNA strands may also be extended to other fields of computing and data engineering such as knowledge representation and reasoning.

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

[12] [13]

D.L. Nazareth, “Investigating the Applicability of Petri Nets for Rule-Based System Verification,” IEEE Trans. Knowledge and Data Eng., vol. 5, no. 3, pp. 402-415, June 1993. D.L. Nazareth and M.H. Kennedy, “Verification of Rule-Based Knowledge Using Directed Graphs,” Knowledge Acquisition, vol. 3, pp. 339-360, 1991. G.S. Gursaran, S. Kanungo, and A.K. Sinha, “Rule-Base Content Verification Using a Digraph-Based Modeling Approach,” Artificial Intelligence Eng., vol. 13, pp. 321-336, 1999. G. Valiente, “Verification of Knowledge Based Redundancy and Subsumption Using Graph Transformations,” Int’l J. Expert Systems, vol. 6, no. 3, pp. 341-355, 1993. M. Ramaswamy, S. Sarkar, and Y.S. Chen, “Using Directed Hypergraphs to Verify Rule-Based Expert Systems,” IEEE Trans. Knowledge and Data Eng., vol. 9, no. 2, pp. 221-236, Mar.-Apr. 1997. S.J.H. Yang, J.P. Tsai, and C.C. Chen, “Fuzzy Rule Base Systems Verification Using High-Level Petri Nets,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 2, pp. 457-473, Mar./Apr. 2003. X. He, W.C. Chu, and H. Yang, “A New Approach to Verify RuleBased Systems Using Petri Nets,” Information and Software Technology, vol. 45, pp. 663-669, 2003. L.M. Adleman, “Molecular Computation of Solutions to Combinatorial Problems,” Science, vol. 266, pp. 1021-1024, 1994. W.-L. Chang and M. Guo, “Molecular Solutions for the SubsetSum Problem on DNA-Based Supercomputing,” BioSystems, vol. 73, pp. 117-130, 2004. M. Amos, Theoretical and Experimental DNA Computation. Springer, 2004. R.S. Braich, C. Johnson, P.W.K. Rothemund, D. Hwang, N. Chelyapov, and L.M. Adleman, “Solution of a Satisfiability Problem on a Gel-Based DNA Computer,” Proc. Sixth Int’l Workshop DNA-Based Computers: DNA Computing, 1999. G. Paun, G. Rozenberg, and A. Salomaa, DNA Computing: New Computing Paradigms. Springer-Verlag, 1998. R.P. Feynman and D. Gilbert, “There’s Plenty of Room at the Bottom,” Eng. and Science Magazine, vol. 23, no. 5, 1960.

975

[14] Q. Liu, Z. Guo, A.E. Condon, R.M. Corn, M.G. Lagally, and L.M. Smith, “A Surface-Based Approach to DNA Computation,” Proc. Second Ann. DIMACS Workshop DNA-Based Computers, 1996. ~ n Þ Volume Molecular [15] B. Fu, R. Beigel, and F.X. Zhou, “An Oð2 Algorithm for Hamiltonian Path,” Biosystems, vol. 52, pp. 217-226, 1999. [16] S. Roweis, E. Winfree, R. Burgoyne, N.V. Chelyapov, M.F. Goodman, P.W.K. Rothemund, and L.M. Adleman, “A StickerBased Model for DNA Computing,” J. Computational Biology, vol. 5, no. 4, pp. 615-629, 1998. [17] R.J. Lipton, “DNA Solution of Hard Computational Problems,” Science, vol. 268, pp. 542-545, 1995. [18] T.A. Nguyen, W.A. Perkins, T.J. Laffey, and D. Pecora, “Knowledge Base Verification,” AI Magazine, vol. 8, no. 2, pp. 69-75, 1987. Chung-Wei Yeh received the MS degree in computer science from the State University of New York, Binghamton. She is currently working toward the PhD degree in the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan. Her research interests include DNA computing, database systems, rule-based systems, parallel processing, and software engineering.

Chih-Ping Chu received the PhD degree in computer science from Louisiana State University. He is currently a professor in the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan. His research interests include parallel computing, Internet computing, DNA computing, and software engineering.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Suggest Documents