The Pennsylvama State Umverstty, Umverstty Park, Pennsylvama ..... other words, the Constable et al. result concerns conditional statements, whereas ..... Computer Science Dept, The Pennsylvania State University, University Park, Pa, Aug.
Assignment Commands with Array References PETER J DOWNEY
The PennsylvamaState Umverstty, UmversttyPark, Pennsylvama AND RAVI SETHI
Bell Laborator:es, Murray Hdl, New Jersey ABSTRACT. Stratght line programs with assignment statements involving both simple and array variables are considered Two such programs are equivalent if they compute the same values as a function of the inputs. Testing the equivalence of array programs ts shown to be NP-hard If array variables are updated but never subsequently referenced, equivalence can be tested in polynomial time Programs without array varmbles can be tested for equivalence in expected linear t~me KEYWORDSANDPHRASES semanttcs, array asstgnments, data structures, NP-complete CRCATEGORIES5.24, 5 25
1. Introduction A n array r e f e r e n c e is e i t h e r the s e l e c t i o n o f i n f o r m a t i o n o u t o f an array, as in t:=a [j], or t h e u p d a t i n g o f an array, as in a [ t ] : = j . C o m p h c a t e d e x p r e s s i o n s and ass i g n m e n t s can be built up using s e q u e n c e s o f array s e l e c t i o n s , array u p d a t e s , and o p e r a t i o n s o f t h e f o r m a : = b ~ c , w h e r e 6 is s o m e operator. W e will be i n t e r e s t e d in efficient a l g o r i t h m s for testing w h e t h e r two p r o g r a m s c o m p o s e d o f a s s i g n m e n t c o m m a n d s c o m p u t e t h e s a m e v a l u e s A variety o f intriguing e x a m p l e s indicate why efficient e q u i v a l e n c e a l g o r i t h m s are hard to find. In the f o l l o w i n g p r o g r a m s , c and d are assigned the s a m e value: a[t]:=3; a [j] :=2;
b[t]:--3; b [j] :--2;
c:=a[t];
d:=b[t];
Both c and d are a s s i g n e d t h e v a l u e o f if t = j t h e n 2 e l s e 3. N e s t e d c o n d i t i o n a l expressions are encountered if c or d is s u b s e q u e n t l y used as In: a [ t ] : = 2 ; a [ j ] : = c ; e : = a [ t ] . Is it o b v i o u s that e is assigned 2? F r o m our earlier e x a m p l e s we k n o w that c is assigned i f ~=j t h e n 2 e l s e 3, w h i l e
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and tts date appear, and nottce is gtven that copying ts by perm~sston of the Association for Computing Machinery To copy otherwise, or to republish, requires a fee and/or specific permission A preliminary version of this paper was presented at the 17th Annual Symposium on Foundations of Computer Science, Houston, Texas, October 1976 The work of P J Downey was partially supported by NSF grant MCS75-22557 Authors' addresses P J Downey, Department of Computer Science, The University of Artzona, Tucson, AZ 85721, R. Sethl, Bell Laboratories, Murray Hill, NJ 07974 The authors provided camera-ready copy for this paper. © 1978 ACM 0004-5411/78/1000-0652 $00.75 Journal ofthc Assoclatlon for Computing Machinery, Vol 25, No 4, October 1978, pp 652-666
Assignment Commands with Array References
653
e gets if t=j then c else 2 Putting these expressions together, e is assigned if t=j then 2 else 2, which simplifies to 2. In the above programs, the effect of sequences of select and update commands is described using equably condmonals which are conditional expressions of the form t f A = B then C else D in which the only predicate that can occur is a test for equality. Such equality conditionals have long been used to describe array assignments. The use of equality conditionals is implicit in McCarthy's [21] and Kaplan's [16] program semantics using state vectors. Burstall's [5] semantics of list assignment uses equality conditionals to update functions representing a list. Axioms for array assignments using such conditionals are given in [12,14,10,19]. See also [23], where axioms for assignments to a class of directed graphs are justified with respect to an interpretive model. The interesting twist (Theorem 3.2) is that equality conditionals can be simulated by select and update commands. We show in Section 3 that the equivalence problem for programs with equality conditional assignments is NP-hard. It follows immediately that equivalence of programs with array select and update commands is NPhard as well (even if the programs involve just one array with two elements!). The reader is referred to [1] for a discussion of NP-hard problems. The study of equivalence of programs with arrays is motivated by the need for expression simplification algorithms in the areas of program optimization, automated program verification, and program testing by symbolic execution. For example, the problem arises m the interactive systems for debugging and testing programs described by Boyer et al. [4] and King [17]. These systems symbolically execute the program being tested, maintaining formal expressions instead of values in the locations of the program. In the course of symbolic execution, the need arises to simplify expressions as far as possible, and to test for equality between expressions. We will be interested here in the complexity of testing equality between expressions involving array references. 2. The Model
2.1 SYNTAX. The syntax of programs in this paper is the same as that in Aho and Ullman [3] and Hoffmann and Landweber [13]. Let S and A be countable sets of stmple and array names, respectively.l As a general rule A,B, • .. denote simple names and c~,/3, .. • denote array names. Informally, associated with a simple name will be a value from some value set V. Array names will be functions from V to V. A special operator (dot) will be used to select particular elements of an array, as in of.A, which may be thought of as the value stored at the element of ~ pointed to by the current value of A. For "operate" commands, we will use a finite set O of operator symbols. Associated with each operator ~ in O is an integer r >/1, called the rank of ~b. A command is a string of symbols having one of the following three forms: 1. operate A ~ qb B1B2 • " • Br, 2. select ,4 .--- ee.B, 3. update e~.,4 ~-- B , where ,4 , B , B i . . . . . Br are elements of S, c~EA, ~bEO, and r is the rank of ~b. Expresstons are built up from 0 and S in the usual way. A program 1r is a triple ( P , I , U), where P is a finite sequence of commands, and I ~We use the term "array" rather than the phrase "structured variable," since we will use the propertyof array references that a[t] and a[j] rather refer to the same element, or they refer to distinct elements This property does not hold for references to certain data structures considered by Burstall [5] and Park [241
654
P. J. D O W N E Y AND R. SETHI
and U are finite subsets of S U A . I and U may be thought of as input and output names. We defer precise definitions of " v a l u e " and "equivalence" to Section 2.3 on semantics. Intuitively, two programs are equivalent if they map inputs to outputs in the same way under all interpretations of the operators. 2 2 THE PROBLEM. Section 3 shows that the problem of testing the equivalence of programs containing select and update commands alone is NP-hard. Therefore, in order to explore the boundary between NP-hard and polynomial equivalence problems, we examine a syntactically more restrictive class of programs Sections 4 and 5 study the equivalence of programs with update and operate commands, Le. those involving no selections from an array. Even for such special programs, proving equivalence raises interesting issues, which are illustrated by the following examples. Aho and Ullman [3] note that the following two programs are equivalent' ot-A ~'- B a ' B '-- A
ot'B ~ A a'A "- B
The reason is that if A ;eB, then the two commands can be interchanged, since a.A and a . B must refer to different locations. If A = B , then they become identical commands. When operate commands are also included, other interesting phenomena occur. Suppose that for some inputs and some operator ~b, A is equivalent to ckAB. An example of such an equivalence is a = b + a , when a = 2 and b=4. If A is equivalent to 6 A B , then it further follows that A is equivalent to 6 6 A B B , and so on. Extending our notation momentarily, the following two programs are equivalent under all interpretations: a.A ~ , h C 6 6 A B B a . 6 A B ~--- ,I~CA
a ' 6 A B ~ ,fJCA ec.A .-- ,I~C66ABB
For if A is not eqmvalent to qbAB then the two commands can be permuted. If A is eqmvalent to 6 A B then the right-hand sides are forced to become equivalent The above examples show that to prove equivalence of even simple array programs, we must develop methods for propagating assumed equalities through expressions. In the foregoing example, the assumed equality A = qbAB implies the equality dJCA = ~hCqbqbABB. It turns out that algorithms for propagating such equalities are closely related to existing algorithms for what has been called "the common subexpression problem." Cocke and Schwartz [6] give an algorithm for detecting identical subexpressions which we review in Section 5. The running time of this algorithm, q,(n), is linear on the average, but is O ( n 2) in the worst case, where n is the number of commands in the program, Using this algorithm as a subroutine, we will give an O(n2,f~(n)) algorithm for the equivalence of programs with operate and update commands. 2.3 REPRESENTATIONAND SEMANTICS. NOW that we know what array programs look like (syntax), we must next define what they mean (semantics). In addition, since we are investigating algorithms, we must deal with issues of efficient representation for array programs In an expression like ( a + b ) / ( c - d ) , the order in which a+b and c - d are computed is not particularly important. Working just with operate commands, Aho and Ullman [2] found it convenient to use a graphical representation of programs that is insensitive to the order in which subexpressions are evaluated. Moreover, m a graph, it is easy to keep track of the "current" value of a name by constructing a separate node for each distinct value. The correspondence between programs with operate commands and their dags (directed acyclic graphs) has been studied by Aho
655
Assignment C o m m a n d s with Array References
= (P,l,U) 1 = {A,B,C,~} D~#,4B c~ D--C a A~B E~a D
U = {e,~}
,4
B
FIG 21 The dag D(Tr) for a program ~r
and Ullman [2] and Culik [9]. The introduction of select and update c o m m a n d s does change the correspondence somewhat, but the flavor is the same. Figure 2.1 illustrates the dag representation we will use in this paper. The symbols | and ,-- at the n o n l e a f nodes in Figure 2.1 can be viewed as special operators corresponding to select and update commands, respectively. Following McCarthy [21], a useful way to view an update c o m m a n d is to imagine that a new array value o~ is computed, based on the old values of o~, A , and B. Nodes labeled ~- therefore have three sons. 2 Nodes corresponding to output values are called output nodes, and are represented by open circles. It is easy to devise a linear time algorithm for constructing the dag D (~') from a program ~r [2]. In order to specify the semantics of dags we will introduce the notions of data and the interpretation of operator symbols in the usual m a n n e r (see for example the treatment of flowchart schemes in M a n n a [20]). The point of departure is that the interpretation of | and ,--, the select and update symbols, is fully specified. We separate data from interpretattons since we wish to consider select and update commands independently of operate commands. Let V be a set of values, and let [ V r--. V] represent the set of all functions from V r to V, for all r, r >/1. A n mterpretatton I assigns a function from [ V r'--. V] to each operator symbol ~b from O of rank r. Data d assigns an e l e m e n t of V to each simple name symbol in I, and an e l e m e n t of [ V---, V] to each array n a m e symbol in the input name set I. Recall that we view each array name as a function that maps "locations" to values. I and d will be said to have base domain V. T h e semantics of | and ~- will be specified independently of the particular interpretation and data. Let 7 r = ( P , I , U ) be a program. A nonleaf node u m D(zr) is called a select, up2Luckham and Suzuki [19] express array assignment, assignment to dereferenced pointers, and assignment to Pascal record structures using these select and update operators The "contents" and "assignment" operattons of McCarthy [21] studied by Kaplan [16] are restricted forms of select and update Kaplan assumes that m c~A and c~ B, A and B can never have the same value since index variables are not assigned to
656
P . . L DOWNEY AND R. SETHI
date, or operate node if u has label |, . - , or 4~, for some operator symbol 6, respectively. The value V(u) of a node u in D(~r) under (I,d), sometimes written Vu, is given by: 1. V ( u ) - - d ( X ) if u is a leaf with label X. If X is a simple name, the value of u will be an element of V; otherwise the value will be a function from V to V. 2. V ( u ) = ( 1 6 ) ( V w b V w 2 . . . . . Vwr) if u has label 6 with sons wbw2 . . . . . Wr. Here the value of u is the result of applying the interpreted function represented by the operator ~ to the values of the sons of u. 3. W(u)=(Vwl)(Vw2) if u is a select node with sons wl and w2. The value of the select node is found by applying the function represented by the first son to the value represented by the second son. 4. V ( u ) = h x . i f x=Vw2 then Vw3 else (Vwl)(x), if wb w2, and w3 are the sons of the update node u. Thus V(u) is the function which agrees with Vwl on all arguments except Vw2, where it takes on value Vw3. The value under (I,d) o l D ( n - ) is the set consisting of the values of the output nodes of D ( ~ ) under (l,d). Nodes u and w in a dag D(~r) are equwalent under (l,d) if V ( u ) = V ( w ) under (l,d). Programs ~r and ~r' are equwalent under (I,d) if D(w) and D(~r') have the same value under (I,d). Two nodes or programs are strong~ equwalent (written ~ ) if they are equivalent under (I,d) for all (I,d). 3 Unless otherwise stated, we will work with singly rooted dags, in which the root is the only output node. Thus the value of a dag will be the value of its root. 3. Strong Equwalence The complexity of the equivalence problem for programs depends on the kinds of commands permitted. If all commands are permitted, then dag D(~r) contains operators from OU{|,,---} and names from S U A . We use the notation "programs over (O,|,~--,S,A)" to denote the fact that operate, select, and update commands are allowed. Similar shorthand is used for other types of programs; e.g. (I,---,S,A) denotes programs having select and update commands, but no operate commands. The complexity of an algorithm working on a dag D will be expressed in terms of the size, ID I, of the dag, which is the sum of the number of nodes and edges of D. The first results examine the complexity of the equivalence problem for programs over ({©},S), where @(A , B , C , D ) is if A =B then C else D. THEOREM 3.1. Let P and Q be programs over ({©},S). Let mterpretatton I map ® to huvwx, t f u=v then w else x. Determmmg tf P and Q are mequivalent under (I,d), fi~r all d, is an NP-complete problem. Moreover the problem ts NP-complete even if the base domam of I and d t s {0,1 }. PROOF. Let 3-SAT be the problem of determining if a Boolean formula in conjunctive normal form with three literals per clause is satisfiable. We will reduce 3-SAT tO the ineqlaivalence problem for programs. We give an example reduction, leaving the rest to the reader. Consider the Boolean formula (yl-kY24-Y3)'(yl+Y2-t-y3). We will construct programs P and Q that are not equivalent if and only if the formula is satisfiable. Program P is as follows: comment: the truth literals in signed N.
A, and B, correspond to y, and fi,, respectively. T will be used to test or falsity of literals represented by A, or B,. If at least one of the clause j is true, then C: will be assigned Y; otherwise C: will be asD will ensure that all clauses are true.
3 Different notions of eqmvalence appear m Aho and UIIman [3] and Hoffmann and Landweber [13] Two programs eqmvalent m either of these senses are strongly eqmvalent, but not conversely
Assignment Commands with Array References
657
I = {T, Y,N,O,1,D,Ai,B1,A2,B2,A3,B3,Ci,C2} Cl "--- if A 1= T then Y else N C1 ~'- if B2= T then Y else Cl C1 - - if A 3-- T then Y else Cl C2 ~---if Bi=T then Y else N C2 - - if A 2= T then Y else C2 C2 - - if A 3= T then Y else C2 D ,-- if Ci=N then 0 else 1 D ~---if C2=N then 0 else D comment: The remaining commands ensure that A, ~B,. D ~ if A l=B1 then 0 else D D ~- if A2=B2 then 0else D D ,--- if A3=B 3 then 0 else D U = {D}.
Program Q is given by I--{0}, U={0} and has no commands. For the two programs to have different values under (I,d), it must be true that the output name D in program P has the same value as 1. For D to have the value of 1, it must be true that A,~B, and C j ~ N . Moreover, the data function d must assign distinct values to Y and N as well as 0 and 1. Thus P and Q are inequivalent if and only if the Boolean formula is satisfiable. Consider the base domain V of I and d. Since all we need are distinct values for Y and N, 0 and 1, A, and B,, the result holds even for v = {0,1}.
Finally, we observe that the inequivalence problem is in NP by constructing a nondeterministic machine M. Given programs P and Q, with a finite set {A,B, • • • } of input names between them, M guesses which name pairs are equal in value: All other pairs are assumed unequal. M then verifies that this set of equalities and inequalities is consistent, and using these relations, executes P and Q to see if they have unequal values. The process takes polynomial time. [] COROLLARY3.1. The equwalence problem for programs over ({~},S) is NP-hard. PROOF. Equivalence and ineqmvalence are complementary problems. Any problem is polynomial Turing reducible to its complement [18]. Since the inequivalence problem is NP-complete and reduces in polynomial time to its complement, the equivalence problem, it follows that the equivalence problem must be NP-hard. [] It is unlikely that the equivalence problem is m NP, since NP-completeness of two complementary problems implies that the sets NP and coNP = {SISENP} coincide [18] It is widely conjectured that NP ~ coNP. Constable et al. [8] give a result similar to Theorem 3.1. They consider loop free programs with conditional branching controlled by arbitrary predicate symbols, allowing assignments to simple variables. For such programs, they show that the inequivalence problem is NP-hard. In Theorem 3.1, the only predicate is equality and there is no flow of control, since the result concerns sequences of assignments. In other words, the Constable et al. result concerns conditional statements, whereas Theorem 3.1 concerns conditional expressions. From the examples in Section 1 and the definition of value for update nodes earlier in this section, the reader may have noted that it is possible for sequences of select and update commands to simulate the equality conditional operator @. The interesting point is that all we need is one two-element array ~ to carry out the simulation.
658
p.J.
DOWNEY AND R. SETHI
THEOREM 3.2. Let P and Q be programs over (l,--,S,{c~}). Determining if P and Q are inequivalent under (I,d) for all I and d is an NP-complete problem. Moreover the result is true even if the base domain of l and d is {0,1}. PROOF. In order to prove this theorem we will show that a sequence of select and update commands using one array o~ can simulate the reduction in the proof of Theorem 3.1. The command E ~ - ® ( A , B , C , D ) can be simulated by the following sequence: a.A - - D; c~-B *- C; E *- ~x.A. Each command in the proof of Theorem 3 1 can be replaced by a sequence of three commands as outlined above. Thus we can reduce satisfiability of Boolean expressions to inequivalence of programs over (I,--,S,A). The inequivalence problem can be shown to be in NP either by showing that any sequence of select and update commands can be simulated by a sequence of operate commands over the equality conditional operator, or directly by constructing a machine M much as in the proof of Theorem 3.1. [] COROLLARY3.2. The equivalence problem for programs over (l,,'--,S,{ct}) ts NP-hard. PROOF. As in Corollary 3.1. [] Let us review the basic reduction used in the proofs above. Note that (i) no "arithmetic" on the indices or values of the arrays was used - the original input values are simply moved about in the array c~ and in the index variables, and (ii) no "indirect addressing" through the array was used. But what ff we restrict the index values to be of one mode (say mteger) and the array values to be of another (say real), as would happen in numerical analysis programs? (Our attention was drawn to mode conflicts by van Leeuwen [26].) We argue that such a separation of modes makes no difference -- equivalence is still intractable. In the reduction of Theorem 3.1, A,,B, and T must be of the same mode since they participate in equality tests. Similarly, Cj, Y, and N must be of the same mode, and D, 0, and 1 must be of the same mode An examination of the reduction shows that if each name had an associated mode, no mode conflicts would occur. 4 With the negative results out of the way, we turn now to algorithms for determining eqmvalence of programs. 4. Charactertzaoon of Eqmvalence Operate commands force us to confront the substitution property of equality illustrated by the following implication: If a = b + a , then a = b ~ - ( b + a ) = b + ( b + ( b ~ a ) ) , and so on. The next example shows why this property arises in program equivalence. Example 4.1. Consider the following two equivalent programs, where I={A,B} and U={a}. 71"1:
ot'A "-'- OAB a.OAB "--- OOABB a . B '--- OBA
zr2:
a.OAB ,-- OOABB ol.B "" OBA a.A "--- OAB
Under a given interpretation (I,d), either A and OAB have the same value, or A and O.4B have different values. In the first case, it follows that OOABB has the same value as OAB, which of course has the same value as A. In this case, the first two instructions in ,r I are both equivalent to a.A ~ A , so they can be exchanged In the second case, A and OAB have different values, so interchanging the two com4 Another relevant observation is that the dag of the program m the proof of Theorem 3 ! ~s a tree A simpler reduction, m terms of the number of distinct modes reqmred, ~s possible ff the underlying graph ~s not restricted to being a tree The structure of the dag used in the reductton becomes tmportant when the boundary between polynomial and NP-hard equivalence problems is explored [25]
Assignment Commands with Array References
659
mands does not affect the function which is the final value of or. Thus, ~ - l ~ - , where ~- is c~.OAB ,-.- OOABB; a.A ~ OAB; c~.B , - OBA. Finally, we argue similarly by cases that the last two instructions in ~r can be exchanged in any interpretation, yielding zr2. Thus ~'1~7r2. [] By formahzing the type of argument suggested by Example 4.1, we will find that, as long as programs contain no select commands, equivalence can be tested in polynomial time. The difference between programs ~r~ and ~r2 in Example 4.1 lies in the order in which particular updates of the array a take place. The extent to which updates can be reordered will be expressed by a logical formula, which can form the basis of an equivalence algorithm. Let C be a singly rooted dag over (O,,---,S,A), that is, a dag without select nodes. Since an operate node cannot be the father of an update node, and there are no select nodes, the update nodes in C must form a "chain." Moreover, if there are any update nodes in C, then the root of C must be an update node. Henceforth we assume that C does indeed have update nodes.
J
Ot
Gn
En
C
~
Fm
Um
D
FIG 41 A sketch of the relatmonshlpbetween dags C, D, E~, G~, F~, Hj
Definmon 4 1. (Refer to Figure 4 1.) Let u0 be the update node that is the root of C. For 1>/0, if u, is not a leaf, then let the sons of u, be U,+l,V,+l, and w,+l. Since all dags are finite, for some n, un must be a leaf Let the leaf un have label c~. For all 1, 1 ~ on ACL
} x:~NX(D); end end MK
FtG 5.2 Value number algorithm
Assignment Commands with Array References comment
663
/*
Input x e and xf are the roots of dags E,F with trees e,f over (O,S) F is assumed to have at least as many nodes as E. D Js the dag to be marked (E and F may be subdags of D) Output Array VN such that VN(x)=VN(y) if and only if x ~(e,f)Y
*/ procedure CMK (D ,xe,Xf ) mltmhze VN to zero, COUNT.=0, ACL =null, call MK(E,ACL,COUNT,VN), call MK(F,ACL,COUNT,VN), alter the item on ACL for xf to record a value number equal to VN(xe); VN (xf) =VN (xe), /*The only two items on ACL with the same value are those corresponding to x e and xf */ call MK(D,ACL,COUNT,VN) end CMK
FIG 5 3 Algorithm to detect congruent nodes
time to retrieve n items from the hash table, and n=lD I. Of course, ~t'(n)=O(n 2) in the worst case, but typical hashing methods yield an expected value of O(n) for qt(n), as long as the hash table is not too full. By using this algorithm, equivalence of programs over (O,S) can be tested in O ( * ( n ) ) ume. The same algorithm serves to solve the equivalence problem for dags over (O,I,S,A) in time OOI'(n)) time, for if arrays are never updated, they may be regarded as functions of one argument. A variation of the value number algorithm may be used to determine all congruent nodes of a dag, subject to e=f. The algorithm is given in Figure 5.3. LEMMA 5.2. At the termmaoon of AIgortthm CMK on dag D, VN(y)=VN(z) tf and only tf y ~(e,f) z PROOF By induction on the sum of the hetghts of nodes y and z. The result is clearly true if nodes y and z are leaves. Assume for nodes y and z that the result is true of their sons. y and z are given the same value number by CMK iff either (i) o P ( y ) = o P ( z ) and the corresponding sons of y and z have identical value numbers, or (ii) OP(y)=OP(Xe), corresponding sons of y and Xe have identical value numbers, oP(z) = o P ( x f ) , and corresponding sons of z and xf have identical value numbers. By the inductive hypothesis, these cases yield (i) o P ( y ) = o P ( z ) and the corresponding sons of y and z are congruent, or (ii) or'(y)=oP(Xe), corresponding sons of y and Xe are congruent, oP(z)=oP(xf), and corresponding sons of z and xf are congruent. This condition for the two cases is equivalent to y -~(e,f) Z. [] Repeated applications of Algorithm CMK will suffice to test almost all the conditions K ( t , j ) . Only the conditions K ( t , j ) with t = n + l or j = m + l involving a remain. The following lemma shows that these conditions can easily be dealt with. LEMMA 5.3. Let dags C and D be as m Definttton 4.2. Then the condmons K ( n + l , j ) , l ~ < j ~ < m + l , and K ( t , m + l ) , l~