Aug 5, 2013 - [6] Nathaniel Bryans, Ehsan Chiniforooshan, David Doty, Lila Kari, and ... [11] David Doty, Randomized self-assembly for exact shapes, SIAM ...
Fast Arithmetic in Algorithmic Self-Assembly Alexandra Keenan∗
Robert Schweller∗
Michael Sherman∗
Xingsi Zhong∗
arXiv:1303.2416v2 [cs.DS] 5 Aug 2013
Abstract In this paper we consider the time complexity of adding two n-bit numbers together within the tile selfassembly model. The (abstract) tile assembly model is a mathematical model of self-assembly in which system components are square tiles with different glue types assigned to tile edges. Assembly is driven by the attachment of singleton tiles to a growing seed assembly when the net force of glue attraction for a tile exceeds some fixed threshold. Within this frame work, we examine the time complexity of computing the sum or product of 2 n-bit numbers, where the input numbers are encoded in an initial seed assembly, and the output is encoded in the final, terminal assembly of the system. We show that the problems √ √ of addition and multiplication have worst case lower bounds of Ω( n) in 2D assembly, and Ω( 3 n) in 3D assembly. In the case of addition, we design algorithms for both 2D and 3D that meet this bound √ √ with worst case run times of O( n) and O( 3 n) respectively, which beats the previous best known upper bound of O(n). Further, we consider average case complexity of addition over uniformly distributed n-bit √ strings and show how to achieve O(log n) average case time with a simultaneous O( n) worst case run 5/6 time in 2D. For multiplication, we present an O(n ) time multiplication algorithm which works in 3D, which beats the previous best known upper bound of O(n). As additional evidence for the speed of our algorithms, we implement our addition algorithms, along with the simpler O(n) time addition algorithm, into a probabilistic run-time simulator and compare the timing results.
∗ Department of Computer Science, University of Texas - Pan American, {abkeenan,rtschweller,mjsherman,zhongxingsi}@ utpa.edu This author’s research was supported in part by National Science Foundation Grant CCF-1117672.
Table 1: Summary of results. Worst Case UB LB √ Addition(2D) Θ( n) (Thm.6.1) (Thm.B.4 ) √ Addition(3D) Θ( 3 n) (Thm.7.1 ) (Thm. B.4 ) Multiplication (3D) O(n5/6 ) Ω(n1/3 ) (Thm. 8.1) (Thm. B.5) Previous Best Addition(2D) O(n) (See [5]) Previous Best Multiplication(2D) O(n) (See [5]) -
1
Average Case O(log n) (Thm.6.1) O(log n) (Thm.7.1) -
Introduction.
Self-assembly is the process by which systems of simple objects autonomously organize themselves through local interactions into larger, more complex objects. Self-assembly processes are abundant in nature and serve as the basis for biological growth and replication. Understanding how to design and efficiently program molecular self-assembly systems promises to be fundamental for the future of nanotechnology. One particular direction of interest is the design of molecular computing systems for the efficient solution of fundamental computational problems. In this paper we study the complexity of computing arithmetic primitives within a well studied model of algorithmic self-assembly, the abstract tile assembly model. The abstract tile assembly model (aTAM) models system monomers with four sided Wang tiles with glue types assigned to each edge. Assembly proceeds by tiles attaching, one by one, to a growing initial seed assembly whenever the net glue strength of attachment exceeds some fixed temperature threshold. The aTAM has been shown to be capable of universal computation [18], and research leveraging this computational power has lead to efficient assembly of complex geometric shapes and patterns with a number of recent results in FOCS, SODA, and ICALP [1, 6, 7, 9–15, 17]. This universality also allows the model to serve directly as a model for computation in which an input bit string is encoded into an initial assembly. The process of self-assembly and the final produced terminal assembly represent the computation of a function on the given input. Given this framework, it is natural to ask how fast a given function can be computed in this model. Tile assembly systems can be designed to take advantage of massive parallelism when multiple tiles attach at distinct positions in parallel, opening the possibility for faster algorithms than what can be achieved in more traditional computational models. On the other hand, tile assembly algorithms must use up geometric space to perform computation, and must pay substantial time costs when communicating information between two physically distant bits. This creates a host of challenges unique to this physically motivated computational model that warrant careful study. In this paper we consider the time complexity of adding or multiplying two n-bit numbers within the abstract tile assembly model. √ √ We show that addition and multiplication have worst-case lower bounds of Ω( n) time in 2D and Ω( 3 n) time in 3D. These lower bounds are derived by a reduction from a simple problem we term the communication problem in which two distant bits must compute the AND function between themselves. This general reduction technique can likely be applied to a number of problems and yields key insights into how one might design a sub-linear time solution to such problems. We in turn √ show that for the problem of addition these lower bounds are matched by corresponding worst case O( n) and √ O( 3 n) run time algorithms, respectively, which improves upon the previous best known result of O(n) [5]. We then consider the average case complexity of addition given two uniformly generated random n-bit numbers√and construct a O(log n) average case time algorithm that achieves simultaneous worst case run time O( n) in 2D. To the best of our knowledge this is the first tile assembly algorithm proposed for efficient average case adding. Finally, we present a 3D algorithm that achieves O(n5/6 ) time for multiplication which beats the previous fastest algorithm of time O(n) [5] (which works in 2D). We analyze our algorithms under two established timing models described in Section 2.4. 1
Temperature = 2 C
Temperature = 2
B
A
B
B
B
A
A
(a) Incorrect binding.
A
B
B
A
(b) Correct binding.
Figure 1: Cooperative tile binding in the aTAM. Our results are summarized in Table 1. In addition to our analytical results, tile self-assembly software simulations were conducted to visualize the diverse approaches to fast arithmetic presented in this paper, as well as to compare them to previous work. The adder tile constructions described in Sections 4, 5 and 6, and the previous best known algorithm from [5] were simulated using the two timing models described in Section 2.4. These results can be seen in the graphs in Section 9.
2 2.1
Definitions Basic Notation.
Let Nn denote the set {1, . . . , n} and let Zn denote the set {0, . . . , n − 1}. Consider two points p, q ∈ Zd , p = (p1 , . . . pd ), q = (q1 , . . . , qd ). Define ∆p,q , max1≤i≤d {|pi − qi |}.
2.2
Abstract Tile Assembly Model.
Tiles. Consider some alphabet of symbols Π called the glue types. A tile is a finite edge polygon (polyhedron in the case of a 3D generalization) with some finite subset of border points each assigned some glue type from Π. Further, each glue type g ∈ Π has some non-negative integer strength str(g). For each tile t we also associate a finite string label (typically “0”, or “1”, or the empty label in this paper), denoted by label(t), which allows the classification of tiles by their labels. In this paper we consider a special class of tiles that are unit squares (or unit cubes in 3D) of the same orientation with at most one glue type per face, with each glue being placed exactly in the center of the tile’s face. We denote the location of a tile to be the point at the center of the square or cube tile. In this paper we focus on tiles at integer locations. Assemblies. An assembly is a finite set of tiles whose interiors do not overlap. Further, to simplify formalization in this paper, we further require the center of each tile in an assembly to be an integer coordinate (or integer triplet in 3D). If each tile in A is a translation of some tile in a set of tiles T , we say that A is an assembly over tile set T . For a given assembly Υ, define the bond graph GΥ to be the weighted graph in which each element of Υ is a vertex, and the weight of an edge between two tiles is the strength of the overlapping matching glue points between the two tiles. Note that only overlapping glues that are the same type contribute a non-zero weight, whereas overlapping, non-equal glues always contribute zero weight to the bond graph. The property that only equal glue types interact with each other is referred to as the diagonal glue function property and is perhaps more feasible than more general glue functions for experimental implementation. An assembly Υ is said to be τ -stable for an integer τ if the min-cut of GΥ is at least τ . Tile Attachment. Given a tile t, an integer τ , and a τ -stable assembly A, we say S that t may attach to A at temperature τ to form A0 if there exists a translation t0 of t such that A0 = A {t0 }, and A0 is τ -stable. For a tile set T we use notation A →T,τ A0 to denote that there exists some t ∈ T that may attach to A to form A0 at temperature τ . When T and τ are implied, we simply say A → A0 . Further, we say that A →∗ A0 if there exists a finite sequence of assemblies hA1 . . . Ak i such that A → A1 → . . . → Ak → A0 .
2
Tile Systems. A tile system Γ = (T, S, τ ) is an ordered triplet consisting of a set of tiles T referred to as the system’s tile set, a τ -stable assembly S referred to as the system’s seed assembly, and a positive integer τ referred to as the system’s temperature. A tile system Γ = (T, S, τ ) has an associated set of producible assemblies, PRODΓ , which define what assemblies can grow from the initial seed S by any sequence of temperature τ tile attachments from T . Formally, S ∈ PRODΓ as a base case producible assembly. Further, for any A ∈ PRODΓ , if A →T,τ A0 , then A0 ∈ PRODΓ . That is, assembly S is producible, and for any producible assembly A, if A can grow into A0 , then A0 is also producible. We further define the set of terminal assemblies TERMΓ to be the subset of PRODΓ containing all producible assemblies that have no attachable tile from T at temperature τ . Conceptually, TERMΓ represents the final collection of output assemblies that are built from Γ given enough time for all assemblies to reach a final, terminal state. General tile systems may have terminal assembly sets containing 0, finite, or infinitely many distinct assemblies. Systems with exactly 1 terminal assembly are said to be deterministic. For a deterministic tile system Γ, we say Γ uniquely assembles assembly A if TERMΓ = {A}. In this paper, we focus exclusively on deterministic systems. For recent consideration of non-determinism in tile self-assembly see [6, 7, 9, 11, 15].
2.3
Problem Description.
We now formalize what we mean for a tile self-assembly system to compute a function. To do this we present the concept of a tile assembly computer (TAC) which consists of a tile set and temperature parameter, along with input and output templates. The input template serves as a seed structure with a sequence of wildcard positions for which tiles of label “0” and “1” may be placed to construct an initial seed assembly. An output template is a sequence of points denoting locations for which the TAC, when grown from a filled in template, will place tiles with “0” and “1” labels that denote the output bit string. A TAC then is said to compute a function f if for any seed assembly derived by plugging in a bitstring b, the terminal assembly of the system with tile set T and temperature τ will be such that the value of f (b) is encoded in the sequence of tiles placed according to the locations of the output template. We now develop the formal definition of the TAC concept. We note that the formality in the input template is of substantial importance. Simpler definitions which map seeds to input bit strings, and terminal assemblies to output bitstrings, are problematic in that they allow for the possibility of encoding the computation of function f in the seed structure. Even something as innocuous sounding as allowing more than a single type of “0” or “1” tile as an input bit has the subtle issue of allowing pre-computing of f 1 . Input Template. Consider a tile set T containing exactly one tile t0 with label “0”, and one tile t1 with label “1”. An n-bit input template over tile set T is an ordered pair U = (R, B(i)), where R is an assembly over T − {t0 , t1 }, B : Nn → Z2 , and B(i) is not the position of any tile in R for any i from 1 to n. The sequence of n coordinates denoted by B conceptually denotes “wildcard” tile positions for which copies of t0 and t1 will be filled in for any instance of the template. For notation we define assembly Ub over T , for bit string b = b1 , . . . bn , to be the assembly consisting of assembly R unioned with a set of n tiles ti for i from 1 to n, where ti is equal a translation of tile tb(i) to position B(i). That is, Ub is the assembly R with each position B(i) tiled with either t0 or t1 according to the value of bi . Output Template. A k-bit output template is simply a sequence of k coordinates denoted by function C : Nk → Z2 . For an output template V , an assembly A over T is said to represent binary string c = c1 , . . . , ck over template V if the tile at position C(i) in A has label ci for all i from 1 to k. Note that output template solutions are much looser than input templates in that there may be multiple tiles with labels “1” and “0”, and there are no restrictions on the assembly outside of the k specified wildcard positions. The strictness for the input template stems from the fact that the input must “look the same” in all ways except for the explicit input bit patterns. If this were not the case, it would likely be possible to encode the solution to the computational problem into the input template, resulting is a trivial solution. 1 This
subtle issue seems to exist with some previous formulations of tile assembly computation.
3
Function Computing Problem. A tile assembly computer (TAC) is an ordered quadruple = = (T, U, V, τ ) where T is a tile set, U is an n-bit input template, and V is a k-bit output template. A TAC is said to compute function f : Zn2 → Zk2 if for any b ∈ Zn2 and c ∈ Zk2 such that f (b) = c, then the tile system Γ=,b = (T, Ub , τ ) uniquely assembles an assembly A which represents c over template V . For a TAC = that computes the function f : Z22n → Z2n+1 where f (r1 . . . r2n ) = r1 . . . rn + rn+1 . . . r2n , we say that = is an n-bit adder TAC with inputs a = r1 . . . rn and b = rn+1 . . . r2n . An n-bit multiplier TAC is defined similarly.
2.4
Run Time Models
We analyze the complexity of self-assembly arithmetic under two established run time models for tile selfasembly: the parallel time model [4, 5] and the continuous time model [2–4, 8]. Informally, the parallel time model simply adds, in parallel, all singleton tiles that are attachable to a given assembly within a single time step. The continuous time model, in contrast, models the time taken for a single tile to attach as an exponentially distributed random variable. The parallelism of the continuous time models stems from the fact that if an assembly has a large number of attachable positions, then the first tile to attach will be an exponentially distributed random variable with rate proportional to the number of attachment sites, implying that a larger number of open positions will speed up the expected time for the next attachment. Technical definitions for each run time model are provided in Section A. For a deterministic tile system Γ, we use notation ρΓ and ςΓ to denote the parallel and continuous run time of Γ respectively. When not otherwise specified, we use the term run time to refer to parallel run time by default.
3
Lower Bound for Long Distance Communication
To prove lower bounds for addition and multiplication in 2D and 3D, we do the following. First, we consider two identical tile systems with the exception of their respective seed assemblies which differ in exactly one tile location. We show in Lemma B.1 that after ∆ time steps, all positions more than ∆ distance from the point of initial difference of the assemblies must be identical among the two systems. We then consider the communication problem, formally defined in Section B.1, in which we compute the AND function of two input bits under the assumption that the input template for the problem separates the two bits by distance ∆. For such a problem, we know that the output position of the solution bit must be at least distance 21 ∆ from one of the two input bits. As the correct output for the AND function must be a function of both bits, Lemma B.1 implies that at least 12 ∆ steps are required to guarantee a correct solution as argued in Theorem B.2. With the lower bound of 12 ∆ established for the communication problem, we move on to the problems of addition and multiplication of n-bit numbers. We show how the communication problem can be reduced to these problems, thereby yielding corresponding lower bounds. In particular, consider the addition problem in 2D. As the input template must √ contain positions for 2n bits, in 2D it must be the case that some pair of bits are separated by at least Ω( n) distance according to Lemma B.3. Focusing on this pair of bit positions in the addition template, we can create a corresponding communication problem template with the same two positions as input. To guarantee the correct output, we hard code the remaining bit positions of the addition template such that the addition algorithm is guaranteed to place the AND of the √ desired bit pair in a specific position in the output template, thereby constituting a solution to the ∆ = Ω( n) communication √ problem, which implies the addition solution cannot finish faster than Ω( n) in the worst case. A similar reduction can be applied to multiplication. The precise reductions are detailed in Theorems B.4 and B.5. The result statements are as follows. √ Theorem 3.1. Any d-dimension n-bit adder TAC has worst case run-time Ω( d n). √ Theorem 3.2. Any d-dimension n-bit multiplier TAC has worst case run-time Ω( d n).
4
a)
b)
MSB of A
LSB of A
MSB of B
LSB of B
MSB
LSB
Figure 2: Arrows represent carry origination and propagation direction. a) This schematic represents the previously described O(n) worst case addition for addends A and B [5]. The least significant and most significant bits of A and B are denoted by LSB and MSB, respectively. b) The average case O(log n) construction described in this paper is shown here. Addends A and B populate the linear assembly with bit Ai immediatly adjacent to Bi . Carry propagation is done in parallel along the length of the assembly.
4
Addition In Average Case Logarithmic Time
We construct an adder TAC that resembles an electronic carry-skip adder in that the carry-out bit for addend pairs where each addend in the pair has the same bit value is generated in a constant number of steps and immediately propagated. When each addend in a pair of addends does not have the same bit value, a carry-out cannot be deduced until the value of the carry-in to the pair of addends is known. When such addends combinations occur in a contiguous sequence, the carry must ripple through the sequence from right-to-left, one step at a time as each position is evaluated. Within these worst-case sequences, our construction resembles an electronic ripple-carry adder. We show that using this approach it is possible to construct an n-bit adder TAC that can perform addition with an average runtime of O(log n) and a worst-case runtime of O(n). Lemma 4.1. Consider a non-negative integer N generated uniformly at random from the set {0, 1, ..., 2n −1}. The expected length of the longest substring of contiguous ones in the binary expansion of N is O(log n). For a detailed discussion of Lemma 4.1 please see Schilling [16]. Theorem 4.2. For any positive integer n, there exists an n-bit adder TAC (tile assembly computer) that has worst case run time O(n) and an average case run time of O(log n). The proof of Theorem 4.2 follows from the construction of the adder in Appendices C.1, C.2, and C.3.
5
√ Optimal O( n) Addition
√ We show how to construct an adder TAC that achieves a run time of O( n), which matches the lower bound proved in Theorem B.4. This adder TAC √ closely resembles an electronic carry-select adder in that the addends are divided into sections of size n and the sum of the addends comprising each is computed for both possible carry-in values. The correct result for the subsection is then selected after a carry-out has been propagated from the previous subsection. Within each subsection, the addition scheme resembles a ripple-carry √ adder. This construction works well with massive parallelism and allows us to construct an optimal O( n) adder TAC in two dimensions. √ Theorem 5.1. There exists a 2D n-bit adder TAC with a worst case run-time of O( n). The proof of Theorem 5.1 follows from the construction of the tile assembly adder in Section 5.1 and Appendices D.1, D.2, and D.3.
5.1
Construction Overview
Due to space limitations, an overview of the construction is included in this section. A more detailed description of the adder construction (along with the full tile set) can be found in Appendix D.1. Figures 5
3(a)-3(b) are examples of I/O templates for a 9-bit adder TAC. The inputs to the addition problem in this instance are two 9-bit binary numbers A and B with the least significant bit of A and B represented by A0 and B0 , respectively. Upon inclusion of the seed assembly (Figure 4a) to the tile set (Figure 18), each pair of bits from A and B are summed (for example, A0 + B0 ) (Figure 4b). Each row computes this addition step independently and outputs a carry or no-carry west face glue on the westernmost tile of each row (Figure 4c). As the addition tiles bind to the seed, tiles from the incrementation tile set (Figure 18c) may also begin to attach. The purpose of the incrementation tiles is to determine the sum for each A and B bit pair in the event of a no-carry from the row below and in the event of a carry from the row below (Figure 5). The final step of the addition mechanism presented here propagates carry or no-carry information northwards from the southernmost row of the assembly. When the carry propagation column reaches the top of the assembly, the most significant bit of the sum may be determined and the calculation is complete (Figure 6a-f). C9
Z
C8
C7
C6
C5
C4
C3
C2
C1
C0
F' B8
A8
B7
A7
B6
A 6 NC
F' B5
A5
B4
A4
B3
A 3 NC
F' NC
B2
A2
B1
A1
B0
A 0 NC
(a) Addition input template. (b) Addition plate.
output
tem-
√ Figure 3: These are example I/O templates for the worst case O( n) time addition introduced in Section 5.
a)
b)
Z
c)
Z
Z
B F' 1
1
1
0
0
0
NC
0
NC
1
1 1 1 0 0 0
1
1
0
0
F'
0 B
NC
0
B
1
B
1 1 1 0 0 0
F'
0 B
C
1 1 1 0 0 0
1 NC
F' 1
1
0
1
1
0
NC
1
1 1 0 1 1 0
1
0
0 1
1
1
F' 0
B
0
B
1
B
1 1 0 1 1 0
F'
1 B
C
1 1 0 1 1 0
0 NC
F' NC
1
1
0
0
0
1
1 1 0 0 0 1
NC
NC
1
0 0
1
B 0
0
F'
1 B
1 1 0 0 0 1 Figure 4: Step 1: Addition.
6
0 NC C
B
0
B
F'
1 B
1 1 0 0 0 1
a)
b)
Z
Z
(0,1) F
(0,0)
B
F' 0
0
B
1
B
F'
(1,0)
B
0
B
B
(1,0)
(0,1)
B F
F'
(0,1)
B
B
(1,0)
B
(1,0)
B
F'
C
C
1 1 0 1 1 0
1 1 0 1 1 0 B
F
B
1 1 1 0 0 0
1 1 1 0 0 0
B
(0,1)
(1,1)
C
C
F'
B
F
F'
0
(1,0)
(0,0)
B
B
(0,1)
B
(1,0)
B
F' 0
0
B
0
F
F'
NC C
NC C
1 1 0 0 0 1
1 1 0 0 0 1
Figure 5: Step 2: Increment. Z
Z
a)
Z
b) (0,0)
B
(1,1)
B
(0,1)
c) (0,0)
B
(1,1)
B
(0,1)
(0,0)
B
F
F
C
C
C
1 1 1 0 0 0
1 1 1 0 0 0 (0,1) F 2C
B
F
B
(1,0)
B
(1,0)
(0,1)
B F
F'
(1,0)
B
(1,0)
2C
1 1 0 1 1 0
(NC,C) (NC,C) (0,0)
(0,1)
B
B
(1,0)
B
(1,0)
B
(0,1)
F'
B
(0,1)
B
(1,0)
1 1 0 1 1 0
2C NC
B
B
(0,1)
B
(1,0)
1 1 0 1 1 0 0 0 N (1,0) B
B
C
(0,0)
B
1 1 0 0 0 1
1 1 0 0 0 1
1 1 0 0 0 1
Z
Z
d)
B
1 1 1 0 0 0
C
B
(1,1)
C
C NC
B
B
e)
2C C
B
(1,1)
B
(0,1)
B
(0,0)
f)
C
0
1
1
1 0
1
1
1 1 1 0 0 0 1 0 0
1 1 1 0 0 0 1 0 0
1 1 0 1 1 0 0 0 1
1 1 0 1 1 0 0 0 1
1 1 0 1 1 0 0 0 1
1 1 0 0 0 1
1 1 0 0 0 1
1 1 0 0 0 1
1 1 1 0 0 0 1 C (1,0) B (1,0) B B
Figure 6: Step 3: Carry Propagation and Output.
7
6
Towards Faster Addition MSB MSB
LSB
LSB
(a)
(b)
Figure √ 7: Arrows represent carry origination and direction of√propagation for a) the O(log n) average case, the O( n) worst case combined construction, and b) the O( n) worst case construction. In this section we combine√the approaches described in Sections 4 and 5 in order to achieve both O(log n) average case addition and O( n) worst case addition. This construction resembles the construction described in Section 5 in that the numbers to be added are divided into sections and values are computed for both possible carry-in bit values. Additionally, the construction described here lowers the average case run time by utilizing the carry-skip mechanism described in Section 4 within each section and between sections. Theorem 6.1. There exists √ a 2-dimensional n-bit adder TAC with an average run-time of O(log n) and a worst case run-time of O( n). The proof follows from the construction of the adder in Appendices E.1, E.2, and E.3.
7
Addition in Three Dimensions
√ In this section we extend our adder construction into the third dimension to achieve a O( 3 n) upper bound, √ which meets the Ω( 3 n) lower bound from √ Theorem √ B.4. √ Due to space and time constraints a general overview is given. We begin by creating 3 n total 3 n × 3 n constructions exactly as per Section 6 stacked one atop the other in an alternating fashion such that every odd plate, beginning with the first, has its MSB in the N W corner while the even plate has its LSB in that same corner. We continue by applying the same addition algorithm that was presented in Section 6 to all plates where every lower plate passes its carry-out to the upper plate. This is very similar to how carry out bits are passed between sections in the construction described in Section 6. Finally, every lower plate will pass the appropriate carry to the next lower plate. Please see Figure 8 for a visual overview of this process. Theorem 7.1. There exists √ a 3-dimensional n-bit adder TAC with an average run-time of O(log n) and a worst case run-time of O( 3 n). We present a high-level sketch of the proof of Theorem 7.1 in Appendix 7.1
7.1
Proof Sketch of Theorem 7.1
√ We present a high-level sketch of the proof of Theorem 7.1 here. Similar with the O(log n) average case O( n) worst case combined addition added√are separated into √ presented in Section 6, the binary √ numbers to be √ sections each with length O( 3 n). We arrange these numbers on 3 n scafolds of size 3 n× 3 n 2D. The dashed lines in Figure 8 are carries transmitted between each plate in the third dimension. The lines on the north are carries transmitted from each odd plate to its next even plate. The dashed lines on the southernmost part of Figure 8 are the carries transmitted from the first grouped plates to the last plate. Since the numbers 8
MSB
LSB First Plate
First Plate
Even Plate
Odd Plate
Even Plate
Odd Plate
Last Plate
MSB
LSB
MSB
LSB
MSB
LSB
LSB MSB
MSB LSB
LSB MSB
MSB LSB
LSB MSB
MSB LSB
LSB MSB
MSB LSB
LSB MSB
MSB LSB
LSB MSB
MSB LSB
LSB
MSB
LSB
MSB
LSB
MSB
Even Plate
Odd Plate
Even Plate
Odd Plate
Last Plate
Figure 8: An overview of the process to add two number in optimal time in 3D. are all fit compactly with only a constant amount space between √ each plate and a constant amount of space this size constraint along with between each section, any one side of the cube is at most O( 3 n). Therefore, √ the algorithms previously presented allow us to have an optimal O( 3 n) time complexity with an average case complexity of O(log n).
8
Multiplication 5
In this section we sketch our O(n 6 ) time multiplication algorithm. Let a = an−1 an−2 . . . a0 and b = bn−1 bn−2 . . . b0 be two n-bit numbers toPbe multiplied. The key idea of our algorithm √ is to compute the n−1 product ab by computing the sum ab = i=0 a · bi · 2i by repeated application of our O( n) time addition algorithm in a parallelized fashion. The statement of our result is as follows: 5
Theorem 8.1. There exists a 3D n-bit multiplier TAC with a worst case run-time of O(n 6 ). √ √ The seed assembly for our construction consists of a 2D O( n) × O( n) pad encoding inputs a and b in a “snaked” pattern as is done with the seed in Appendix E.1. Our algorithm expands this seed into a O(n5/6 ) × O(n5/6 ) × O(n5/6 ) mega-cube consisting of n number blocks each of dimension O(n1/2 ) × O(n1/2 ) × O(n1/2 ). Each number cube conceptually is associated with a distinct term from the numbers a · bi · 2i , and the mega-cube is organized into a O(n1/3 ) × O(n1/3 ) × O(n1/3 ) array of number cubes (yielding the total dimension of O(n5/6 )). Assembly of the mega-cube proceeds by initializing the growth of O(n1/3 ) rows of number cubes as shown in Figure 9 (b). The growth that initializes each row extends from the seed assembly at a O(1) rate, thereby initializing all rows in time O(n5/6 ). Each row proceeds by computing the appropriate value√a · bi · 2i for the given number cube and adding it via our 2D addition algorithm to a running total in O( n) time per number cube. Note that computing a given value a · bi · 2i is simply a bit shift rather than a full fledged multiplication and can therefore be derived quickly as shown in Appendix F. Therefore, each row finishes in O(n1/2 · n1/3 ) = O(n5/6 ) time once initialized, yielding a O(n5/6 ) finish time for the entire first level of the mega-cube shown in Figure 9 (b). Further, in addition to initializing the growth of the first plane of the mega-cube, a column is initialized to grow up into the third dimension to seed each of the O(n1/3 ) levels of the mega-cube, implying that all levels finish in time O(n5/6 ). Finally, each sum from the rows of a given level are computed into subtotals for each level as depicted in Figure 9 (c), followed by the summation of these subtotals into a final solution as shown in Figure 9 (d). The details for our multiplication system are provided in Appendix F.1The full details of our construction are complex and our description in this abstract may be difficult to follow for a general audience. As ongoing work we are developing an expanded write-up and analysis that will be more approachable and we will present it in the final published version of this paper. Further, our run time bound of O(n5/6 ) applies to 9
a)
b)
c)
d)
Figure 9: High level sketch of multiplication the parallel time model. We therefore get a O∗ (n5/6 ) continuous time bound from Lemma A.1. We believe a tighter analysis should yield an equal asymptotic bound for both run time models.
9
Simulation
Tile self-assembly software simulations were conducted to visualize the diverse approaches to fast arithmetic presented in this paper, as well as to compare them to previous work. The adder tile constructions described in Sections 4-6, and the previous best [5] were simulated using the two timing models described in Sections A. Figure 10 demonstrates the power of the approaches described in this paper compared to previous work.
Figure 10: A comparison of adder constructions.
10
Future Work
The results of this paper are just a jumping off point and provide numerous directions for future work. One promising direction is further exploration of the complexity of multiplication. Can O(n1/3 ) or O(n1/2 ) be achieved in 3D? Is sublinear multiplication possible in 2D, or is there a Ω(n) lower bound? What is the average case complexity of multiplication? Another direction for future work is the consideration of metrics other than run time. One potentially important metric is the geometric space taken up by the computation. Our intent is that fast function computing systems such as those presented in this paper will be used as building blocks for larger and more complex self-assembly algorithms. For such applications, the area and volume taken up by the computation is clearly an important constraint. Exploring tradeoffs between run time and imposed space limitations for 10
computation may be a promising direction with connections to resource bounded computation theory. Along these lines, another important direction is the general development of design methodologies for creating black box self-assembly algorithms that can be plugged into larger systems with little or no “tweaking”. A final direction focusses on the consideration of non-deterministic tile assembly systems to improve expected run times even for maniacally designed worst case input strings. Is it possible to achieve O(log n) expected run time for the addition problem regardless of the input bits? If not, are there other problems for which there is a provable gap in achievable assembly time between deterministic and non-deterministic systems?
Acknowledgements We would like to thank Ho-Lin Chen and Damien Woods for helpful discussions regarding Lemma C.2 and Florent Becker for discussions regarding timing models in self-assembly. We would also like to thank Matt Patitz for helpful discussions of tile assembly simulation.
11
References [1] Zachary Abel, Nadia Benbernou, Mirela Damian, Erik Demaine, Martin Demaine, Robin Flatland, Scott Kominers, and Robert Schweller, Shape replication through self-assembly and RNase enzymes, SODA 2010: Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms (Austin, Texas), Society for Industrial and Applied Mathematics, 2010. [2] Leonard Adleman, Qi Cheng, Ashish Goel, and Ming-Deh Huang, Running time and program size for self-assembled squares, Proceedings of the thirty-third annual ACM Symposium on Theory of Computing (New York, NY, USA), ACM, 2001, pp. 740–748. [3] Leonard M. Adleman, Qi Cheng, Ashish Goel, Ming-Deh A. Huang, David Kempe, Pablo Moisset de Espan´es, and Paul W. K. Rothemund, Combinatorial optimization problems in self-assembly, Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, 2002, pp. 23–32. [4] Florent Becker, Ivan Rapaport, and Eric R´emila, Self-assembling classes of shapes with a minimum number of tiles, and in optimal time, Foundations of Software Technology and Theoretical Computer Science (FSTTCS), 2006, pp. 45–56. [5] Yuriy Brun, Arithmetic computation in the tile assembly model: Addition and multiplication, Theoretical Computer Science 378 (2007), 17–31. [6] Nathaniel Bryans, Ehsan Chiniforooshan, David Doty, Lila Kari, and Shinnosuke Seki, The power of nondeterminism in self-assembly, SODA 2011: Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, 2011, pp. 590–602. [7] Harish Chandran, Nikhil Gopalkrishnan, and John H. Reif, The tile complexity of linear assemblies, 36th International Colloquium on Automata, Languages and Programming, vol. 5555, 2009. [8] Qi Cheng, Ashish Goel, and Pablo Moisset de Espan´es, Optimal self-assembly of counters at temperature two, Proceedings of the First Conference on Foundations of Nanoscience: Self-assembled Architectures and Devices, 2004. [9] Matthew Cook, Yunhui Fu, and Robert T. Schweller, Temperature 1 self-assembly: Deterministic assembly in 3d and probabilistic assembly in 2d, Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011 (Dana Randall, ed.), SIAM, 2011, pp. 570–589. [10] Erik Demaine, Matt Patitz, Trent Rogers, Robert Schweller, Scott Summers, and Damien Woods, The two-handed tile assembly model is not intrinsically universal, Proceedings of the 40th International Colloquium on Automata, Languages and Programming (ICALP 2013), 2013. [11] David Doty, Randomized self-assembly for exact shapes, SIAM Journal on Computing 39 (2010), no. 8, 3521–3552. [12] David Doty, Jack H. Lutz, Matthew J. Patitz, Robert Schweller, Scott M. Summers, and Damien Woods, The tile assembly model is intrinsically universal, FOCS 2012: Proceedings of the 53rd IEEE Conference on Foundations of Computer Science, 2012. [13] David Doty, Matthew J. Patitz, Dustin Reishus, Robert T. Schweller, and Scott M. Summers, Strong fault-tolerance for self-assembly with fuzzy temperature, Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS 2010), 2010, pp. 417–426. [14] Bin Fu, Matthew J. Patitz, Robert Schweller, and Robert Sheline, Self-assembly with geometric tiles, ICALP 2012: Proceedings of the 39th International Colloquium on Automata, Languages and Programming (Warwick, UK), 2012.
12
[15] Ming-Yang Kao and Robert T. Schweller, Randomized self-assembly for approximate shapes, International Colloqium on Automata, Languages, and Programming, Lecture Notes in Computer Science, vol. 5125, Springer, 2008, pp. 370–384. [16] Mark Schilling, The longest run of heads, The College Mathematics Journal 21 (1990), no. 3, 196–207. [17] Robert Schweller and Michael Sherman, Fuel efficient computation in passive self-assembly, Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 2013. [18] Erik Winfree, Algorithmic self-assembly of DNA, Ph.D. thesis, California Institute of Technology, June 1998. [19] Damien Woods, Ho-Lin Chen, Scott Goodfriend, Nadine Dabby, Erik Winfree, and Peng Yin, Efficient active self-assembly of shapes, Manuscript (2012).
13
A
Run Time Model Definitions
Parallel Run-time. For a deterministic tile system Γ = (T, S, τ ) and assembly A ∈ PRODΓ , the 1-step transition set of assemblies forSA is defined to be ST EPΓ,A = {B ∈ PRODΓ |A →T,τ B}. For a given A ∈ PRODΓ , let PARALLELΓ,A = B∈ST EPΓ,A B, ie, PARALLELΓ,A is the result of attaching all singleton tiles that can attach directly to A. Note that since Γ is deterministic, PARALLELΓ,A is guaranteed to not contain overlapping tiles and is therefore an assembly. For an assembly A, we say A ⇒Γ A0 if A0 = PARALLELΓ,A . We define the parallel run-time of a deterministic tile system Γ = (T, S, τ ) to be the non-negative integer k such that A1 ⇒Γ A2 ⇒Γ . . . ⇒Γ Ak where A1 = S and {Ak } = TERMΓ . As notation we denote this value for tile system Γ as ρΓ . For any assemblies A and B in PRODΓ such that A1 ⇒Γ A2 ⇒Γ . . . ⇒Γ Ak with A = A1 and B = Ak , we say that A ⇒kΓ B. Alternately, we denote B with notation A ⇒kΓ . For a TAC = = (T, U, V, τ ) that computes function f , the run time of = on input b is defined to be the parallel run-time of tile system Γ=,b = (T, Ub , τ ). Worst case and average case run time are then defined in terms of the largest run time inducing b and the average run time for a uniformly generated random b. Continuous Run-time. In the continuous run time model the assembly process is modeled as a continuous time Markov chain and the time spent in each assembly is a random variable with an exponential distribution. The states of the Markov chain are the elements of P RODΓ and the transition rate from A to A0 is |T1 | if A → A0 and 0 otherwise. For a given tile assembly system Γ, we use notation ΓM C to denote the corresponding continuous time Markov chain. Given a deterministic tile system Γ with unique assembly U , the run time of Γ is defined to be the expected time of ΓM C to transition from the seed assembly state to the unique sink state U . As notation we denote this value for tile system Γ as ςΓ . One immediate useful fact that follows from this modeling is that for a producible assembly A of a given deterministic tile system, if A → A0 , then the time taken for ΓM C to transition from A to the first A00 that is a superset of A0 (through 1 or more tile attachments) is an exponential random variable with rate 1/|T |. For a more general modeling in which different tile types are assigned different rate parameters see [2, 3, 8]. Our use of rate 1/|T | for all tile types corresponds to the assumption that all tile types occur at equal concentration and the expected time for 1 tile (of any type) to hit a given attachable position is normalized to 1 time unit. Note that as our tile sets are constant in size in this paper, the distinction between equal or non-equal tile type concentrations does not affect asymptotic run time. For a TAC = = (T, U, V, τ ) that computes function f , the run time of = on input b is defined to be the continuous run-time of tile system Γ=,b = (T, Ub , τ ). Worst case and average case run time are then defined in terms of the largest run time inducing b and the average run time for a uniformly generated random b. Relating Parallel Time and Continuous Time. The following Lemma states an upper and lower bound on continuous time with respect to parallel time. Both bounds are straightforward to derive with the lower bound appearing in [4] and the upper bound being derivable in a fashion similar to the proof of Theorem 5.2 in [2]. The most dramatic distinction between parallel and continuous time occurs when the number of tile types, |T |, is large, as this slows down assembly in the continuous model but does not affect parallel time. When |T | = O(1), which we adhere to in this paper, the lemma implies that the timing models are very close. Lemma A.1. Consider a deterministic aTAM system Γ = (T, S, τ ) that uniquely assembles (finite) assembly A. Let ρΓ denote the parallel time for Γ to assemble A and let ςΓ denote the continuous time for Γ to assemble A. Then, • ςΓ ≥ |T |ρΓ • ςΓ = O(|T |ρΓ log(ρΓ + |S|)) It is not clear if the log(ρΓ ) factor in the upper bound for continuous time is necessary, and in fact for the algorithms in this paper the parallel and continuous run times are equal up to constant factors. However, the log(|S|) factor for general systems is important. Consider a 1 × |S| line of tiles for a seed assembly to 14
which |S| copies of a single tile type may attach to form a terminal 2 × |S| assembly. This system has a parallel run time of just 1, but Θ(log |S|) continuous run time, implying that the log |S| factor in our bound is tight in the general case. In [4] a comparison between these two run time models is considered in the extension in which tile types may have different concentrations.
B
Lower Bounds
B.1
Communication Problem.
The ∆-communication problem is the problem of computing the function f (b1 , b2 ) = b1 ∧ b2 for bits b1 and b2 in the 3D aTAM under the additional constraint that the input template for the solution U = (R, B(i)) be such that ∆ = max(|B(1)x − B(2)x |, |B(1)y − B(2)y |, |B(1)z − B(2)z |). We first establish a lemma which intuitively states that for any 2 seed assemblies that differ in only a single tile position, all points of distance greater than r from the point of difference will be identically tiled (or empty) after r time steps of parallelized tile attachments: Lemma B.1. Let Sp,t and Sp,t0 denote two assemblies that are identical except for a single tile t versus t0 at position p = (px , py , pz ) in each assembly. Further, let Γ = (T, Sp,t , τ ) and Γ0 = (T, Sp,t0 , τ ) be two deterministic tile assembly systems such that Sp,t ⇒rΓ0 R and Sp,t0 ⇒rΓ0 R0 for non-negative integer r. Then for any point q = (qx , qy , qz ) such that r < max(|px − qx |, |py − qy |, |pz − qz |), it must be that Rq = Rq0 , ie, R and R0 contain the same tile at point q. The proof for Lemma B.1 can be found in Appendix B.3. Theorem B.2. Any solution to the ∆-communication problem has run time at least 12 ∆. The proof for Theorem B.2 may be found in Appendix B.4.
B.2
√ Ω( d n) Lower Bounds for Addition and Multiplication.
We now show how to reduce instances of the communication problem to the√arithmetic problems of addition √ and multiplication in 2D and 3D to obtain lower bounds of Ω( n) and Ω( 3 n) respectively. We first show the following Lemma which lower bounds the distance of the farthest pair of points in a set of n points. We believe this Lemma is likely well known, or is at least the corollary of an established result. We include it’s proof for completeness. T d Lemma B.3. For positive integers n and d, consider A ⊂ Zd and B ⊂ Z√ such that A B = ∅ and |A| = |B| = n. There must exist points p ∈ A and q ∈ B such that ∆p,q ≥ d 21 d d 2nee − 1. The proof of Lemma B.3 can be found in Appendix B.5. Theorem B.4. Any n-bit adder TAC that has a dimension d input template for d = 1, d = 2, or d = 3, has √ a worst case run time of Ω( d n). The proof of Theorem B.4 can be found in Appendix B.6. As the bound of dimension d on the input template of a TAC alone lower bounds the run time of the TAC, we get the following corollary. We now provide a lower bound for multiplication. Theorem B.5. Any n-bit multiplier TAC that has a dimension d input template for d = 1, d = 2, or d = 3, √ has a worst case run time of Ω( d n). The proof of Theorem B.5 can be found in Appendix B.7. As with addition, the lower bound implied by the limited dimension of the input template alone yields the general lower bound for d dimensional multiplication TACS. 15
B.3
Proof of Lemma B.1
We show this by induction on r. As a base case of r = 0, we have that R = Sp,t and R0 = Sq,t0 , and therefore R and R0 are identical at any point outside of point p by the definition of Sp,t and Sp,t0 . Inductively, assume that for some integer k we have that for all points w such that k < ∆p,w , max(|px − 0 0 k wx |, |py − wy |, |pz − wz |), we have that Rw = Rwk , where Sp,t ⇒kΓ Rk , and Sp,t0 ⇒kΓ0 R k . Now consider some point q such that k + 1 < ∆p,q , max |px − qx |, |py − qy |, |pz − qz |, along with assemblies Rk+1 and 0 0 R k+1 where Sp,t ⇒k+1 Rk+1 , and Sp,t0 ⇒k+1 R k+1 . Consider the direct neighbors (6 of them in 3D) of 0 point q. For each neighbor point c, we know that ∆p,c > k. Therefore, by inductive hypothesis, Rck = Rck 0 where Sp,t ⇒kΓ Rk , and Sp,t0 ⇒kΓ0 R k . Therefore, as attachment of a tile at a position is only dependent on 0 the tiles in neighboring positions of the point, we know that tile Rqk+1 may attach to both Rk and R k at 0 position q, implying that Rqk+1 = Rqk+1 as Γ and Γ0 are deterministic.
B.4
Proof of Theorem B.2
Consider a TAC = = (T, U = (R, B(i)), V (i), τ ) that computes the ∆-communication problem. First, note that B has domain of 1 and 2, and V has domain of just 1 (the input is 2 bits, the output is 1 bit). We now consider the value ∆V defined to be the largest distance between the output bit position of V from either of the two input bit positions in B: Let ∆V , max(∆B(1),V (1) , ∆B(2),V (1) ). Without loss of generality, assume ∆V = ∆B(1),V (1) . Note that ∆V ≥ 12 ∆. Now consider the computation of f (0, 1) = 0 versus the computation of f (1, 1) = 1 via our TAC =. Let A0 denote the terminal assembly of system Γ0 = (T, U0,1 , τ ) and let A1 denote the terminal assembly of system Γ1 = (T, U1,1 , τ ). As = computes f , we know that A0V (1) 6= A1V (1) . Further, from Lemma B.1, we know that for any r < ∆V , we have that WV0 (1) = WV1 (1) for any W 0 and W 1 such that U0,1 ⇒rΓ0 W 0 and U1,1 ⇒rΓ1 W 1 . Let d= denote the run time of =. Then we know that U0,1 ⇒dΓ=0 A0 , and U1,1 ⇒dΓ=1 A1 by the definition of run time. If d= < ∆V , then Lemma B.1 implies that that A0V (1) = A1V (1) , which contradicts the fact that = compute f . Therefore, the run time d= is at least ∆V ≥ 12 ∆.
B.5
Proof of Lemma B.3
To see this, consider √ a bounding box of all 2n points. If all d dimensions of the bounding box were of length strictly less that√d 2n, then the box could not contain all 2n points. Therefore, √ at least one dimension is of length at least d d 2ne, implying that there are two points of distance at least d d 2ne −1 along that particular axis. If these two points are in A and B respectively,√ then the claim follows. If not, say both are from set A, then there must be a point in B that is at least d 21 d d 2nee − 1 from one of these two points in A, implying the claim.
B.6
Proof of Theorem B.4
√ To show the lower bound, we will reduce the ∆-communication problem for some ∆ = Ω( d n) to the n-bit adder problem with a d-dimension template. Consider some n-bit adder TAC = = (T, U = (F, W ), V, τ ) such that U is a d-dimension template. The 2n sequence of wildcard positions W of this TAC must be contained in d-dimensional space by the definition of a d-dimension template, and therefore by Lemma B.3 there must √ √ exist points W (i) for i ≤ n, and W (n + j) for j ≤ n, such that ∆W (i),W (n+j) ≥ d 21 d d 2nee − 1 = Ω( d n). Now consider two n-bit inputs a = an . . . a1 and b = bn . . . b1 to the adder TAC = such that: ak = 0 for any k > i and any k < j, and ak = 1 for any k such that j ≤ k < i. Further, let bk = 0 for all k 6= j. The remaining bits ai and aj are unassigned variables of value either 0 or 1. Note that the i + 1 bit of a + b is 1 if and only if ai and bj are both value 1. This setup constitutes our reduction of the ∆-communication problem to the addition problem as the adder TAC template with the specified bits hardcoded in constitutes a template for the ∆-communication problem that produces the AND of the input bit pair. We now specify explicitly how to generate a communication TAC from a given adder TAC. 16
For given n-bit adder TAC = = (T, U = (F, W ), V, τ ) with dimension d input template, we derive a ∆-communication TAC ρ = (T, U 2 = (F 2 ,√ W 2 ), V 2 , τ ) as follows. First, let W 2 (1) = W (i), and W 2 (2) = d W (n + j). Note that as ∆W√(i),W (n+j) = Ω( n), W 2 satisfies the requirements for a ∆-communication input template for some ∆ = Ω( d n). Derive the frame of the template F 2 from F by adding tiles to F as follows: For any positive integer k > i, or k < j, or k > n but not k = n + j, add a translation of t0 (with label “0”) translated to position W (k). Additionally, for any k such that j ≤ k < i, add a translation of t1 (with label “1”) at translation W (k). √ Now consider the ∆-communication TAC ρ = (T, U 2 = (F 2 , W 2 ), V 2 , τ ) for some ∆ = Ω( d n). As assembly Ua2i ,bj = Ua1 ...an ,b1 ...bn , we know that the worst case run time of ρ is at most that of the worst case √ run time of =. Therefore, by Theorem B.2, we have that = has a run time of at least Ω( d n).
B.7
Proof of Theorem B.5
Consider some n-bit multiplier TAC = = (T, U = (F, W ), V, τ ) with d-dimension input template. By √ Lemma B.3, some W (i) and W (n + j) must have distance at least ∆ ≥ d 21 d d 2nee − 1. Now consider input strings a = an . . . a1 and b = bn . . . b1 to = such that ai and bj are of variable value, and all other ak and bk have value 0. For such input strings, the i + j bit of the product ab has value 1 if and only if ai = bj = 1. Thus, we can convert the n-bit multiplier system = into a ∆-communication TAC with the same worst case √ run time in the same fashion as for Theorem B.4, yielding a Ω( d n) lower bound for the worst case run time of =.
C
Average Case O(log n) Time Addition.
C.1
Construction x
MSB' MSB''
COut1
0
0
CIn1
¬x
MSB'' MSB'
MSB
COut0
1
1
CIn0
C0 C0
COut0 CIn1
C1 C1
COut1 C1
x
¬x ¬x
(x,¬x) 0 1
1 0
S S
(x,¬x) (x,¬x) 0
0
x
¬x
(0,x) (0,x)
1
LSB
LSB'
LSB''
c) LSB Carry In
x
1 COut1
d) Carry Transfer
CIn1
COut0
(1,x) (1,x)
S
x
x
COut0 C0
LSB'
CIn0
b) Print Answer
COut1 CIn0
LSB''
x
¬x a) MSB Slot
CIn0
1
S 0 1
S 0
e) Addition
Figure 11: This is the complete set of tiles necessary to implement average case O(log n) and worst case O(n) addition. We summarize the mechanism of addition presented here in a short example. The complete tile set may be found in Figure 11.
17
Input Template. The input template, or seed, for the construction of an adder with an O(log n) average case is shown in Figure 12. This input template is composed of n blocks, each containing three tiles. Within a block, the easternmost tile is the S labeled tile followed by two tiles representing Ak and Bk , the kth bits of A and B respectively. Of these n blocks, the easternmost and westernmost blocks of the template assembly are unique. Instead of an S tile, the block furthest east has an LSB-labeled tile which accompanies the tiles representing the least significant bits of A and B, A0 and B0 . The westernmost block of the template assembly contains a block labeled M SB instead of the S block and accompanies the most significant bits of A and B, An−1 and Bn−1 . Cn
C0
Ck
Cn-1
MSB An-1 Bn-1
S
Ak
Bk
S
A0
B0
LSB
MSB An-1 Bn-1
S
Ak
Bk
S
A0
B0
LSB
MSB An-1 Bn-1
S
Ak
Bk
S
A0
B0
LSB
Figure 12: Top: Output template displaying addition result C for O(log n) average case addition construction. Bottom: Input template composed of n blocks of three tiles each, representing n-bit addends A and B.
Computing Carry Out Bits. For clarity, we demonstrate the mechanism of this adder TAC through an example by selecting two 4-bit binary numbers A and B such that the addends Ai and Bi encompass every possible addend combination. The input template for such an addition is shown in Figure 13a where orange tiles represent bits of A and green tiles represent bits of B. Each block begins the computation in parallel at each S tile. After six parallel steps (Figure 13b-g), all carry out bits, represented by glues C0 and C1, are determined for addend-pairs where both Ai and Bi are either both 0 or both 1. For addend pairs Ai and Bi where one addend is 0 and one addend is 1, the carry out bit cannot be deduced until a carry out bit has been produced by the previous addend pair, Ai−1 and Bi−1 . By step seven, a carry bit has been presented to all addend pairs that are flanked on the east by an addend pair comprised of either both 0s or both 1s, or that are flanked on the east by the LSB start tile, since the carry in to this site is always 0 (Figure 13h). For those addend pairs flanked on the east by a contiguous sequence of size j pairs consisting of one 1 and one 0, 2j parallel attachment steps must occur before a carry bit is presented to the pair. Computing the Sum. Once a carry out bit has been computed and carried into an addend pair Ai and Bi , two parallel tile addition steps are required to compute the sum of the addend pair (Figure 13g-j) .
C.2 C.2.1
Time Complexity. Parallel Time
O(n) - worst case. We first show that this construction has an O(n) worst case run-time under the timing model presented in Section A Run-time. Consider a binary sequence T of length 2n representing two n-bit binary numbers A and B. A0 and B0 represent the least significant bits of A and B, respectively, and An and Bn represent the most significant bits of A and B, respectively. The formatting of T is such that Ti = Bi/2 if i is even, and Ti = A(i−1)/2 if i is odd. Sequence T is shown in Figure 15a.
18
a)
b)
LSB'
MSB'
MSB
1
S
1
0
1 1
S
0
0
S
1
1
0 1
0 0
S
LSB
0
MSB MSB
1 0
c)
S
S
S 1
S
1
S 0
S
S
0
LSB
S 0
1
S
1
0
1 1
0 0
0 1
1 0
(1,x)
(0,x)
(x,-x)
(x,-x)
d) LSB''
CIn0 x
MSB'' 1 MSB' MSB'
S
0
S
1 1
1
S
0
S
COut1
0
S
1 0
S
0
0 1
S
1
0 1
0 0
0
MSB'' LSB' LSB'
1
S
1
1 0
-x
(0,x)
0
-x
(x,-x)
(0,x)
(x,-x) CIn0
(x,-x)
COut1 COut1 x
(x,-x)
0 1
0 0
C0
1 0
C1
C0
x
0
COut0 COut0
C0
1 1 i)
1
0 1
0
1 0 -x
x
COut0 COut0 x
x -x
x
1 1
-x
-x
CIn0
-x
-x
0 1
0 0
1 0
h) C1 CIn0
x
1
C0
C1
g) CIn1
0
LSB'' LSB''
1
0
0
0 0
x
1 1
0
1
0
1
1 1
x (1,x)
0
f) x
(1,x)
1
MSB''
COut0 x
x
1
0
1 1 e)
LSB
CIn0
x
0 0 CIn0
-x
0 1 C0 C0
1
CIn0
-x
CIn0 -x
1 x
CIn1
CIn1 x
1 0
0 x CIn0 x
CIn0
COut0
1 -x
x
CIn0 CIn0
C0 -x
C0
1
1 1
0 0
0 1
1 0
0
0
1
1
1 1
0 0
0 1
1 0
j)
COut0
x
1
0 1 1
x
0 0
1
1
0 1
1 0
COut0
1
Figure 13: Example tile assembly system to compute the sum of 1001 and 1010 using the adder TAC presented in Section 4. a) The template with the two 4-bit binary numbers to be summed. b) The first step is the parallel binding of tiles to the S, LSB, and M SB tiles. c)Tiles bind cooperatively to west-face S glues and glues representing bits of B. The purpose of this step is to propagate bit Bi closer to Ai so that in d) a tile may bind cooperatively, processing information from both Ai and Bi . e) Note that addend-pairs consisting of either both 1s or both 0s have a tile with a north face glue consisting of either (1,x) or (0,x) bound to the Ai tile. This glue represents a carry out of either 1 or 0 from the addend-pair. In (e-g) the carry outs are propagated westward via tile additions for those addend-pairs where the carry out can be determined. Otherwise, spacer tiles bind. h) Tiles representing bits of C (the sum of A and B) begin to bind where a carry in is known. i-j) As carry bits propagate through sequences of (0,1) or (1,0) addend pairs, the final sum is computed. 19
a)
a
b
S
S
a S
b
S
a
b
S S
S a
a
S
A 0i B1i
Ai Bi Ai and Bi are both 0s or both 1s.
Ai and Bi are different; a=0, b=1 or a=1, b=0.
COuta
b)
(a,x)
a
a
a
a
b
A 0i B1i
A 0i B 0i The (a,x) means that the value of the carry generated for next block is a and the result in this block will be the value of the carry-in, x, from previous block.
c)
(x,¬x)
b
(a,x) a
¬x
(x,¬x)
x a
The (x,¬x) means that the value of the carry which will be propagated to the next block is x, which is same as the carry-in from the previous block. Also, the result in this block will be NOT x.
Ca CIna
Ca
COuta COuta x x
¬x x
¬x ¬x
A 0i B 0i b
d)
A 1i 0i B CInb
CInb
COutb
x CIna
CInb x
Cb
Cb
Cb
A 0i B 0i
COutb
a
CInb
¬x CInb ¬x
Cb
A 0i B1i
Here we see that if the input values are equal, the carry that was just generated will be sent to the next block immediately. However, if the input values are not equal, the carry for next block will be propagated after the carry-in is accepted.
e)
CIna
b
Cb
CInb
a
Cb
A 0i B 0i
A 0i B1i
Ai=Bi=a; carry=b; Ai+Bi+carry=ab
Ai=¬Bi; a=¬b; carry=b; Ai+Bi+carry=ba
Figure 14: The set of figures on the left side show how a carry out may be generated before the addend-pair has recieved a carry in. The set of figures on the right side show how a carry out can be dependent upon a carry in having been recieved by the addend-pair. 20
a)
An-1 Bn-1
Ai
Bi
A0
b)
B0
Ai 0 1 1 0 Bi 0 1 0 1 carry out: 0 1 ? ?
Figure 15: a) Binary sequence T populated by n-bit binary numbers A and B. b) Four possible (Ai , Bi ) addend-pair combinations. T contains n addend-pairs, (Ai , Bi ), which are ordered pairs consisting of the ith bit of A and the ith bit of B. The four possible values for each addend-pair are shown in Figure 15b, along with the carry bits they produce upon addition. In T , there exist sequences of various sizes up to k addend-pairs such that every addend-pair in the interval from (Ai , Bi ) to (Ai+j−1 , Bi+j−1 ) of a size j addend-pair sequence is (1, 0) or (0, 1). In the adder TAC outlined in Section 4 and Appendix C.1, the value of the carry bit produced upon addition of Ai + Bi for every (0, 0) and (1, 1) addend-pair is known and made available to the next addend-pair (Ai+1 , Bi+1 ) after seven parallel tile addition steps, including the carry bit into (A0 , B0 ). Therefore, after a constant number of parallel tile addition steps, the first addend-pair (Ai , Bi ) in any j addend-pair-length sequence of (1, 0) or (0, 1) addend-pairs will be presented with the carry bit from the previous addend-pair (Ai−1 , Bi−1 ). After 2j subsequent parallel tile addition steps, the carry bit from the last addend-pair (Ai+j−1 , Bi+j−1 ) of the j-size sequence of consecutive (0, 1) and (1, 0) addend-pairs is presented to (Ai+j , Bi+j ). Once an addend-pair has recieved a carry-in bit, the final sum is computed in one tile addition step. If k is the longest contiguous sequence of (0, 1) and (1, 0) addend-pairs, then 2k + 8 parallel tile addition steps are required to compute the sum C of A and B (Figure 16). Therefore, the time complexity is O(k). Since the growth is bounded upwards by the longest contiguous sequence k of (0, 1) and carry 1
7 steps:
carry 1
An-1=0 Bn-1=1 An-2=1Bn-2=0 Ak+5=1Bk+5=1
carry 1
2k+7 steps: A
carry 1
carry 0
n-1=0 Bn-1=1 An-2=1Bn-2=0 Ak+5=1Bk+5=1
carry 0
carry 1
A5=0 B5=1 A4=1 B4=0 A3=1 B3=0 A2=1 B2=1 A1=0 B1=0 A0=1 B0=1
carry 1
carry 1
carry 1
carry 1
carry 0
carry 1
A5=0 B5=1 A4=1 B4=0 A3=1 B3=0 A2=1 B2=1 A1=0 B1=0 A0=1 B0=1
Figure 16: (1, 0) addend-pairs, then the worst-case scenario occurs when k = n. Thus, the worst-case run time is O(n). O(log n) - average case. We now show that the average case run-time is O(log n) under the timing model presented in Section A. In two n-bit randomly generated binary numbers, A and B, the probability of one of the (1, 0) (0, 1) addend-pair cases occurring at (Ai , Bi ) is 1/2. The sequence of (Ai , Bi ) bit pairs can thus be thought of as a Bernoulli process in which the likelihood of occurrence of a (0, 1) or (1, 0) addend-pair is equal to the occurrence of a (1, 1) or (0, 0) addend-pair. As shown above, the runtime is bounded by k, the longest contiguous sequence of (0, 1) or (1, 0) addend-pairs, which might be thought of as the longest sequence of heads in n independent fair coin tosses. Using Lemma 4.1, the expected longest run of heads in n coin tosses is O(log n) [16]. Therefore, the average case time complexity of the tile addition algorithm described above is O(log n). C.2.2
Continuous Time
To analyze the continuous run time of our TAC we first introduce a Lemma C.2. This Lemma is essentially Lemma 5.2 of [19] but slightly generalized and modified to be a statement about exponentially distributed random variables rather than a statement specific to the self-assembly model considered in that paper. To derive this Lemma, we make use of a well known Chernoff bound for the sum of exponentially distributed random variables stated in Lemma C.1.
21
Pn Lemma C.1. Let X = i=1 Xi be a random variable denoting the sum of n independent exponentially distributed random variables each with respective rate λi . Let λ = min{λi |1 ≤ i ≤ n}. Then P r[X > )n . (1 + δ)n/λ] < ( δ+1 eδ Pki ti,j for some positive integer Lemma C.2. Let t1 , . . . , tm denote m random variables where each ti = j=1 ki , and the variables ti,j are independent exponentially distributed random variable each with respective rate λi,j . Let k = max(k1 , . . . km ), λ = min{λi,j |1 ≤ i ≤ m, 1 ≤ j ≤ ki }, λ0 = max{λi,j |1 ≤ i ≤ m, 1 ≤ j ≤ ki }, m m and T = max(t1 , . . . tm ). Then E[T ] = O( k+log ) and E[T ] = Ω( k+log ). λ λ0 Pki m λi,j ≤ k/λ. Applying Proof. First we show that E[T ] = O( k+log ). For each ti we know that E[ti ] = j=1 λ Lemma C.1 we get that: δ+1 k ) . eδ Applying the union bound we get the following for all δ ≥ 3: P r[ti > k/λ(1 + δ)] < (
P r[T > Let δm =
2 ln m k .
(1 + δ)k δ+1 ] ≤ m( δ )k ≤ me−kδ/2 , for all δ > 3. λ e
By plugging in δm + x for δ in the above inequality we get that: P r[T >
(1 + δm + x)k ] ≤ e−kx/2 . λ
Therefore we know that: E[T ] ≤
k + 2 ln m + λ
Z
∞
e−kx/2 = O(
x=0
k + log m ). λ
m To show that E[T ] = Ω( k+log ), first note λ0 log m Ω( λ0 ), implying that E[T ] = Ω( logλ0m ). Next,
that T ≥ max1≤i≤m {ti,1 } and that E[max1≤i≤m {ti,1 }] = observe that T ≥ ti for each i. Since E[ti ] is at least k/λ0 for at least one choice of i, we therefore get that E[T ] = Ω(k/λ0 ).
The next Lemma helps us upper bound continuous run times by stating that if the assembly process is broken up into separate phases, denoted by a sequence of subassembly waypoints that must be reached, we may simply analyze the expected time to complete each phase and sum the total for an upper bound on the total run time. The following Lemma follows easily from the fact that a given open tile position of an assembly is filled at a rate at least as high as any smaller subassembly. Lemma C.3. Let Γ = (T, S, τ ) be a deterministic tile assembly system and A0 . . . Ar be elements of P RODΓ such that A0 = S, Ar is the unique terminal assembly of Γ, and A0 ⊆ A2 ⊆ . . . Ar . Let ti be a random 0 0 variable denoting Pr the time for ΓM C to transition from Ai−1 to the first Ai such that Ai ⊆ Ai for i from 1 to r. Then ςΓ ≤ i=1 E[ti ]. We now bound the run time of our adder TAC by breaking the assembly process up according to waypoint assemblies A0 = S, A1 , and A2 . We then bound the expected transition time from A0 to A1 and A1 to A2 to get a bound from Lemma C.3. Let producible assembly A1 be the seed line with the additional 4 or 5 tiles (dependant on the type of bit pair) placed atop each bit pair as shown in Figure 13 (g). As each collection of 4 or 5 tiles can assemble independently, the time t1 to transition from the seed to a superset of A1 is bounded by the maximum of n sums of at most 5 exponentially distributed random variables with rate 1/|T |, where T is the tileset for the adder TAC. By Lemma C.2, E[t1 ] = O(log n) time for any input pair of n-bit numbers. Now consider the time t2 to transition from A1 to the unique terminal assembly of the system, i.e, the completion of the third row of the assembly. For this phase we are interested in the time taken for the last 22
finishing chain of tile placements for each maximal contiguous sequence of (0, 1) − (1, 0) addend pairs. If the largest of these sequences is length k, then t2 is bounded by the maximum of at most n sums of at most 3k exponentially distributed random variables with rate 1/|T |. Lemma C.2 therefore implies E[t2 ] = O(k). As k ≤ n we get a worst case time of E[t2 ] = O(n). Further, by Lemma 4.1, we have that the average value of k is O(log n), yielding an average case time of E[t2 ] = O(log n). Applying Lemma C.3, we get an upper bound for the total runtime of our TAC of E[t1 ] + E[t2 ] which is O(n) in the worst case and O(log n) in the average case. Our O(log n) average case is the best that can be hoped for in the continuous time model as any adder TAC must place at least n + 1 tiles to guarantee correctness. From Lemma C.2, this immediately gives an Ω(log n) lower bound in all cases: Theorem C.4. The continuous run time for any n-bit adder TAC is Ω(log n) in all cases.
C.3
Correctness. CIn0
x
1
1
1 1
S
0
S
0
0 0
0
1
0 1
S
1
0
S
1 0
Figure 17: The scaffolding needed for the computation. The analysis for the average case O(log n) worst case O(n) adder TAC will begin from the point of the construction after which all scaffolding is in place (Figure 17). The first step for any addend-pair in the seed is to perform its addition. This addition step will output either (1, x), (0, x), or (x, ¬x) according to the following formulas: 1 + 1 = (1, x) 0 + 0 = (0, x) 1 + 0 = (x, ¬x) 0 + 1 = (x, ¬x) The first value in the output represents the carry-out of the addend-pair and the second value represents the value of the addition. In both cases, x refers to the unknown carry-in which will come from the previous addend-pair’s carry out. Given any addend-pair, we can then determine whether or not we have enough information to generate a carry by looking at the first position of the generated output pair in the first step. The two cases that can generate a carry, (1, x) and (0, x), do just that and immediately propagate a 1 or 0 respectively. The other case, (x, ¬x), simply waits for a carry to come in. No addend-pair can calculate its value until it receives a carry from the previous addend-pair. Once an addend-pair receives a carry-in, it can replace any x or ¬x with the proper value and it can “print” the correct value, i.e. the solution at that particular position. Also, if k0 was x it can now propagate its carry to the next addend-pair.
D D.1
√ O( n) Time Addition. Construction.
Input/Output Template. Figure 19(a) and Figure 19(b) are examples of I/O templates for a 9-bit adder TAC. The inputs to the addition problem in this instance are two 9-bit binary numbers A and B with the 23
a)
B
B NC
0
NC
0
1
1
0 10
C
d)
0
1
C
C
(NC,C)
(NC,NC)
(C,C)
(C,C)
NC
0
NC NC
(0,1)
1 C
0 1
C
C
0
0
C
NC NC
0
1
C
(0,0)
(1,0)
Z
F'
C
C
(1,1)
1
NC NC
(0,0)
(NC,NC)'
(C,C)'
(C,NC)'
(NC,NC)
(C,C)
(C,NC)
C
NC NC
F
2C
2NC NC
F (C,C)'
(NC,C)
(NC,NC)
C
NC
2C
2NC
(0,1) F' F
1
0
(1,1) F
F
B F
0
F' B
(0,0) F
1
B F' F'
F
F B
F
2NC C
(C,C)
F'
(C,NC)'
2C NC
c)
F'
(NC,C)'
B
(1,1)
F'
(NC,C)
(NC,NC)'
(1,0) NC
C
F
(1,0)
C C
0
(0,1)
(C,NC)'
NC
(NC,C)' C
NC
F'
1
(C,C)'
NC
1
C
10
C
1
0 1
NC
C
1
1 0
(NC,NC)'
b)
C
10
0
1 0
(NC,C)'
B C
1
1
0 NC
NC
1
0
B
C (C,NC)
NC B
Z
NC
Figure 18: The complete tile set. a) Tiles involved in bit addition. b) Carry propagation tiles. c) Tiles involved in incrementation. d) Tiles that print the answer.
C9
Z
C8
C7
C6
C5
C4
C3
C2
C1
C0
F' B8
A8
B7
A7
B6
A 6 NC
F' B5
A5
B4
A4
B3
A 3 NC
F' NC
B2
A2
B1
A1
B0
A 0 NC
(a) Addition input template.
(b) Addition output template.
√ Figure 19: These are example I/O templates for the worst case O( n) time addition introduced in Section 5.
24
least significant bit of A and B represented by A0 and B0 , respectively. The north facing glues in the form of Ai or Bi in the input template must either be a 1 or a 0 depending on the value of the bit in Ai or Bi . The placement for these tiles is shown in Figure 19(a) while a specific example of a possible input template is shown in Figure 20a. The sum of A + B, C, is a ten bit binary number where C0 represents the least significant bit. The placement for the tiles representing the result of the addition is shown in Figure 19(b) while a specific example of an output is shown in Figure 22j. √ To construct an√n-bit adder in the case that n is a perfect square, split the two n-bit numbers into n sections each√with n bits. Place the bits for each of these two numbers as per the previous paragraph, except with n bits per row, making sure to alternate between A and B bits. There will be the same amount of space between each row as seen in the example template 19(a). All Z, NC , and F 0 , must be placed in the same relative locations. The solution, C, will be in the output template s.t. Ci will be three tile positions above Bi and a total of size n + 1. Below, we use the adder tile set to add two nine-bit numbers: A = 100110101 and B = 110101100 to demonstrate the three stages in which the adder tile system performs addition. Step One: Addition. With the inclusion of the seed assembly (Figure 20a) to the tile set (Figure 18), the first subset of tiles able to bind are the addition tiles shown in Figure 18a. These tiles sum each pair of bits from A and B (for example, A0 + B0 ) (Figure 20b). Tiles shown in yellow are actively involved in adding and output the sum of each bit pair on the north face glue label. Yellow tiles also output a carry to the next more significant bit pair, if one is needed, as a west face glue label. Spacer tiles (white) output a B glue on the north face and serve only to propagate carry information from one set of A and B bits to the next. Each row computes this addition step independently and outputs a carry or no-carry west face glue on the westernmost tile of each row (Figure 20c). In a later step, this carry or no-carry information will be propagated northwards from the southernmost row in order to determine the sum. Note that immediately after the first addition tile is added to a row of the seed assembly, a second layer may form by the attachment of tiles from the increment tile set (Figure 18)c. While these two layers may form nearly concurrently, we separate them in this example for clarity and instead address the formation of the second layer of tiles in Step Two: Increment below.
a)
b)
Z
c)
Z
Z
B F' 1
1
1
0
0
0
NC
0
NC
1
1 1 1 0 0 0
1
0
1
0
F'
0 B
NC
0
B
1
B
B
C
1 1 1 0 0 0
F'
0
1 1 1 0 0 0
1 NC
F' 1
1
0
1
1
0
NC
1
1 1 0 1 1 0
1
0
0 1
1
1
F' 0
B
0
B
1
B
1 1 0 1 1 0
F'
1 B
C
1 1 0 1 1 0
0 NC
F' NC
1
1
0
0
0
1
1 1 0 0 0 1
NC
NC
1
0 0
1
B 0
0
F'
1 B
1 1 0 0 0 1
0 NC C
B
0
B
F'
1 B
1 1 0 0 0 1
Figure 20: Step 1: Addition.
Step Two: Increment. As the addition tiles bind to the seed, tiles from the incrementation tile set (Figure 18c) may also begin to cooperatively attach. For clarity, we show their attachment following the
25
completion of the addition layer. The purpose of the incrementation tiles is to determine the sum for each A and B bit pair in the event of a no-carry from the row below and in the event of a carry from the row below (Figure 21).The two possibilities for each bit pair are presented as north facing glues on yellow increment tiles. These north face glues are of the form (x, y) where x represents the value of the sum in the event of no-carry from the row below while y represents the value of the sum in the event of a carry from the row below. White incrementation tiles are used as spacers, with the sole purpose of passing along carry or no-carry information via their east/west face glues F 0 , which represents a no-carry, and F , which represents a carry.
a)
b)
Z
Z
(0,1) F
(0,0)
B
F' 0
0
B
1
B
0
F'
(1,0)
B
0
B
B
(1,0)
(0,1)
B F
F'
(0,1)
B
B
(1,0)
B
(1,0)
B
F'
C
C
1 1 0 1 1 0
1 1 0 1 1 0 B
F
B
1 1 1 0 0 0
1 1 1 0 0 0
B
(0,1)
(1,1)
C
C
F'
B
F
F'
(1,0)
(0,0)
B
B
(0,1)
B
(1,0)
B
F' 0
0
B
0
F
F'
NC C
NC C
1 1 0 0 0 1
1 1 0 0 0 1
Figure 21: Step 2: Increment.
Step Three: Carry Propagation and Output. The final step of the addition mechanism presented here propagates carry or no-carry information northwards from the southernmost row of the assembly using tiles from the tile set in Figure 18b and then outputs the answer using the tile set in Figure 18d. Following completion of the incrementation layers, tiles may begin to grow up the west side of the assembly as shown in Figure 22a. When the tiles grow to a height such that the empty space above the increment row is presented with a carry or no-carry as in Figure 22b, the output tiles may begin to attach from west to east to print the answer (Figure 22c). As the carry propagation column grows northwards and presents carry or no carry information to each empty space above each increment layer, the sum may be printed for each row Figures 22d-e. When the carry propagation column reaches the top of the assembly, the most significant bit of the sum may be determined and the calculation is complete (Figure 22f).
D.2 D.2.1
Time Complexity. Parallel Time
Using the runtime model presented in Section A√Run-time we will show that the addition algorithm presented in this section has a worst case runtime of O( n). In order to ease the analysis we will assume that each logical step of the algorithm happens in a synchronized fashion even though parts of the algorithm are running concurrently. √ The first step of the algorithm is the addition of two O( n) numbers co-located on the same row. This first step occurs by way of a linear growth starting from the leftmost bit all the way to the rightmost bit of the row. The growth of a line one tile at a √ time has a runtime on the order of the length √ of the line. In the case of our algorithm, the row is of size O( n) and so the runtime for each row is O( n). The addition 26
Z
Z
a)
Z
b) (0,0)
B
(1,1)
B
(0,1)
c) (0,0)
B
(1,1)
B
(0,1)
(0,0)
B
F
F
C
C
C
1 1 1 0 0 0
1 1 1 0 0 0 (0,1) F 2C
B
F
B
(1,0)
B
(1,0)
(0,1)
B F
F'
(1,0)
B
(1,0)
2C
1 1 0 1 1 0
(NC,C) (NC,C) (0,0)
(0,1)
B
B
(1,0)
B
(1,0)
B
(0,1)
F'
B
(0,1)
B
(1,0)
1 1 0 1 1 0
2C NC
B
B
(0,1)
B
(1,0)
1 1 0 1 1 0 0 0 N (1,0) B
B
C
(0,0)
B
1 1 0 0 0 1
1 1 0 0 0 1
1 1 0 0 0 1
Z
Z
d)
B
1 1 1 0 0 0
C
B
(1,1)
C
C NC
B
B
e)
2C C
B
(1,1)
B
(0,1)
B
(0,0)
f)
C
0
1
1
1
0
1
1
1 1 1 0 0 0 1 0 0
1 1 1 0 0 0 1 0 0
1 1 0 1 1 0 0 0 1
1 1 0 1 1 0 0 0 1
1 1 0 1 1 0 0 0 1
1 1 0 0 0 1
1 1 0 0 0 1
1 1 0 0 0 1
1 1 1 0 0 0 1 C (1,0) B (1,0) B B
Figure 22: Step 3: Carry Propagation and Output. √ √ of each of the n rows happens independently, in parallel, leading to a O( n) runtime for all rows. Next, we increment each solution in each row of the addition step, keeping both the new and old values. As with the first step, each row can be completed independently in parallel by way of a linear growth across the row √ leading to a total runtime of O( n) for this step. After we increment our current working solutions we must both generate and propagate the carries for each row. In our algorithm, this line √ simply involves growing a √ across the leftmost wall of the rows. The size of the wall is bounded by O( n) and so this step takes O( n) time. Finally, in order to output the result bits into their proper places we simply grow a line atop the line created by the increment step. This step has the same √ runtime properties as the addition and increment n) to output all rows. steps. Therefore, the output step has a runtime of O( √ √ There are four steps each taking O( n) time leading to a total runtime of O( n) for this algorithm. This upper bound meets the lower bound presented in Corollary 3.1 and the algorithm is therefore optimal. √ √ √ √ Choice of n rows of n size. √The choice for dividing the bits up into a n× n grid is straightforward. Imagine that instead of using O( n) bits per row, a much smaller growing function such as O(log n) bits per row is used. Then, each row would finish in O(log n) time. After each row finishes, we would have to 27
propagate the carries. The length of the west wall would no longer be bound by the slow growing function √ O( n) but would now be bound by the much faster growing function O( logn n ). Therefore, there is a distinct trade off between the time necessary to add each row and the time necessary to propagate the carry with this scheme. The runtime of this algorithm can be viewed as the max(rowsize , westwallsize ). The best way to minimize this function is to divide the rows such that we have the same number of rows as columns, i.e. √ n rows of the smallest partition into the smallest sets. The best way to partition the bits is therefore into √ n bits. D.2.2
Continuous Time
To bound the continuous run time of our TAC for any pair of length n-bit input strings we break the assembly process up into 4 phases. Let the phases be defined by 5 producible assemblies S = A0 , A1 , A2 , A3 , A4 = F , where F is the final terminal assembly of our TAC. Let ti denote a random variable representing the time taken to grow from Ai−1 to the first assembly A0 such that A0 is a superset of Ai . Note that E[t1 ] + E[t2 ] + E[t3 ] + E[t4 ] is an upper bound for the continuous time of our TAC according to Lemma C.3. We now specify the four phase assemblies and bound the time to transition between each phase to achieve a bound on the continuous run time of our TAC for any input. Let A1 be the seed assembly plus the attachment of one layer of tiles above each of the input bits (see Figure 20 (c) for an example assembly). Let A2 be assembly A1 with the second layer of tiles above the input bits placed (see Figure 21 (b)). Let A3 be assembly A2 with the added vertical chain of tiles at the western most edge of the completed assembly, all the way to the top ending with the final blue output tile. Finally, let A4 be the final terminal assembly of the system, which is the assembly A3 with the added third layer of tiles attached above the input bits. Time t1 is the maximum of time taken to complete the attachment of the 1st row of tiles above input √ bits for each of the n rows. As each row can begin independently, and each row grows from right to left, with each tile position waiting for the √ previous tile position, each row completes in time dictated by a random variable which is the sum of n independent exponentially distributed random variables with rate 1/|T √|, where√T is the √tileset of our adder TAC. Therefore, according √ to Lemma C.2, we have that t1 = Θ( n + log n)√= Θ( n). The same analysis applies to get t2 = Θ( n). The value t3 is simply√the value of the sum√of 4 n + 2 exponentially distributed random variables and thus has expected value Θ( n). Finally, √ t4 = Θ( n) by the same analysis given for t1 and t2 . Therefore, by Lemma C.3, our TAC has worst case O( n) continuous run time.
D.3
Correctness.
The first two steps of the algorithm are addition followed by incrementation. It is important to note that this incrementation step not only outputs both the original (addition result) and incremented value but also whether or not the original value contained a zero. This is important because it will allow us to later use this information to decide whether or not a row will contain a carry. Now, these two steps of the algorithm rely on nothing but the data in the current row and are completed in parallel across all rows. Therefore, with the given tile set in Figure 18, these two steps are straightforward to verify. The next step is for every row to select a solution from the two that were generated as well as propagate its carry information. We will, for the moment, assume that some row has not received the information of the previous carry and will concentrate on this row. At this step we know if the current row generated a carry C ∈ {T, F }, if the rows sum contains a zero Z ∈ {T, F }, and two possible row values A and B. A represents the sum if the row receives a carry-in of 0 and B the opposite. In order to continue from this step an answer must be selected and a carry or no-carry must be propagated. If Z = T , then we immediately know that we will propagate whatever C may be. If Z = F and C = T , we also know that we must propagate a carry. The only situation in which we do not know whether we will propagate a carry is when Z = F and C = F . When we encounter this situation we propagate whatever the carry was in the previous row. Also, in order to decide whether we will select A or B we need only the previous carry. Therefore, assuming we have the
28
correct previous carry we can correctly select both the proper value for this row as well as propagate the correct carry to the next row. Finally, because we know the initial carry is correct (it is part of the seed), we know that the first row can select the correct result as well as propagate the correct carry. Then, because we know that the first row’s carry is correct, we know that the second row can select the correct result as well as propagate the carry. This chain continues until it reaches the last row leading to selecting all the correct values as well as propagating all the correct carries. The last step is to select the most significant bits value which is solely based on the last row’s carry. If the last row propagates a carry it is a 1, otherwise it is a 0. Since we know that the last carry is correct we know that the last value is selected properly.
E E.1
√ O(log n) Average Case, O( 2 n) Worst Case Addition. Construction
This construction combines the two-dimensional scaffold principle of the simple worst-case addition construction (Section 5) with the principle that certain addend-pairs can compute a carry out before they get a carry in, which was shown to reduce the average case run-time to O(log n) in Section C.2. Contrary to the addition construction in Section D.1, the directionality of adjacent rows is antiparallel in the construction described here. Every odd row beginning with row one, which is the southernmost row, has the least significant bit on the east and the most significant bit on the west. Every even row has bits in the opposite order, as shown in Figure 25a. Each odd row, along with the even row above, should be considered as a single section in which the addition mechanism is nearly identical to the O(log n) average case adder TAC. Within each section, carry outs are propagated east to west on the odd row, up in constant time from the most significant bit on the odd row (OMSB) to the least significant bit on the even row (ELSB), and from west to east on the even row. Adjacent rows are anti-parallel so that the distance between the most significant bit (MSB) of each section is a constant distance from the least significant bit (LSB) of the section above. This modification allows us to apply the O(log n) average case√addition between each pair of bits, and at the same time apply the carry propagation mechanism of the O( n) worst case addition construction between the MSB of each even row. Carry Passing and Prediction of Results Within Sections. Within each section, or odd/even row pair, carry outs are propagated as stated in the above paragraph (Figure 26a). In addition, each row will present “predicted” results, that is, the result of the computation as if there were no carry from the MSB of the previous section. These results are shown as yellow tiles in Figure 26b-c. Note the blue tiles in Figure 26d. These blue tiles represent the final sum of the addend pairs. They may be computed without carry information propagated from a section below because the addend pairs are either 1)located in the southernmost section, where it is immediately known that the carry-in is 0 or 2)are flanked by a less significant addend pair for which the carry out is immediately known. The north face glues on yellow tiles in Figure 26d which contain a star * rely on a carry from a previous section. The mechanism of carry propagation within a section can be seen in Figure 27a-b. Vertical columns grow up along the west side of each section to propagate the carry from the odd row to the even row. Carry Propagation Between Sections. Figure 27c shows a vertical column growing between the southernmost section and the section above. This column is propagating a carry from the most significant bit of the southernmost section A to the least significant bit of the section B to the north. This information continues to the most significant bit of section B at which point a carry bit is computed to propagate to the section north of B (Figure 27d, Figure28a. The terminal tile assembly with the computed sum is presented in Figure 28b.
29
LSB''
OLSB''
LSB' S
CIn0 LSB'
LSB''
MSB' MSB' OLSB
OLSB'
OLSB''
a) LSB Carry In
b) OLSB Carry In
OE0*''
OE0*''
CIn0*
c) MSB Slot
CIn0
CIn1
OE0''
OE1''
OE0*''
OE0''
OE1''
OE0'
OE1'
OE0'
OE1'
0 OE0*
OMSB'
OMSB1 OMSB1
1
CIn0
OMSB*
OE0*
OMSB'
CIn1
OMSB*
OE0
OE1
OMSB*
OE0
OE1
OE0*
OE0*
Ox
CIn1
CIn0 OMSB0 OMSB0
OE0*'
x
MSB'
OE0*''
OE0*'
MSB'
S
CIn0* LSB
MSB
OLSB'
0
0
CIn0*
Ox
CIn0
Ox
1
OMSB' OMSB' CIn1
Ox
OMSB
d)OMSB Slot
EMSB
Ex CIn0
0
EO0
EO0_1
EMSB' CIn1
1
EO0_0 EO0_0
EO0_8
Ex EO1
EO0_1
EO0
EO1_1
CIn1
EMSB' EMSB'
CIn1
CIn0 CIn0
EO1_1
EO1 EO0_1
EO0_8
Ex EMSB'
CIn0* Ex
0
EO*
EO0_a-1
EO*
EO1_0 EO1_0
EO1_8
EO1_8
EO1_1 EO1_a-1
EO* EO0_a a∈{N|2≤a≤8}
CIn0
CIn1
EO1_a a∈{N|2≤a≤8}
e) EMSB Slot
√ Figure 23: Partial set of tiles necessary to implement O(log n) average case, O( n) worst case combined addition (see also next figure).
30
OCOut1
OCOut0
O(1,x) O(1,x)
Ox
O¬x O¬x
O(0,x) O(0,x) 1
x
¬x Ox
Ox
0
S S
O(x,¬x) O(x,¬x)
O(x,¬x) 1
0
1
S
0
S
1
0
0
1
1
0
0
1
1
0
0
1
S
0
S
1
1
Ex
E¬x ¬x
0 E(x,¬x) E(1,x)
E(0,x) E(x,¬x)
S S
1
0 E(1,x)
E(x,¬x) E(0,x) Ex
Ex
E¬x
x
ECOut1
ECOut0
a) Addition
ECOut0 CIn0
OC0
OC0
OCOut0 OC0 OCOut0
CIn1
OC1
OC1
EC0 ECOut0
EC0 EC0
CIn0
EC1 ECOut1
EC1 EC1
CIn1
EC0* EC0*
CIn0*
ECOut1
OCOut1 OC1 OCOut1
CIn0*
OC0* OC0*
OCOut0*
ECOut0*
b) Carry Transfer
x
OCOut1
OCOut0
¬x
0
1
O0
O1
E0
E0
O0
O0
x
¬x
0
CIn1
0
CIn0
CIn0
CIn1
0 1
CIn1
CIn0
0 1
x OCOut1
ECOut1
0
CIn1
O1*
OCOut0
ECOut0
1
1
CIn0
0
x
E1
E1
O1*
O0*
O1
O1
E0
E1
O1*
O0*
0
1
¬x
CIn0
1
CIn1
OCOut0*
x
1 ¬x
CIn0*
O0*
¬x
1
CIn1
CIn0*
0
CIn0
CIn0
0
¬x CIn0*
E0*
E1*
E0*
E1*
0
CIn0
E0* CIn0*
CIn1
1
1
1
ECOut0*
ECOut0
E1* CIn1
0
ECOut1
x
c) Print Answer
√ Figure 24: Tiles necessary to implement O(log n) average case, O( n) worst case combined addition (continued from previous figure).
31
a)
A120 0 B120 0 S
A022 0 B122 0
A021 0 B021 0 S
S
b)
A123 0 B023 0 S
A20B20
A21B21
A22B22
A23B23
C20
C21
C22
C23
LSB
C19
OMSB
S
S
S
C18
A19B19 A18B18 A17B17 A16B16 A12B12 A13B13 A14B14 A15B15
C12
C13 C11
S
S
S
C5 C3
A13 B13
S A02 B02
C9
C8
C6 C2
C7 C1
C0
LSB
S A01 B1
C10
C15
A11B11 A10B10 A9 B9 A8 B8 B4 A5 B5 A6 B6 A7 B7
C6
S
C14
OLSB
A111B011 A010B110 A09 B19 A18 B08 A14 B14 A05 B05 A06 B16 A17 B07 S S S S EMSB
OMSB
C24 C16
OLSB
A119 A018 A017 A116 0 B119 0 0 B018 0 0 B117 0 0 B016 0 A112 A113 A014 A115 0 B012 0 0 B013 0 0 B114 0 0 B015 0 S S S S EMSB
OMSB
C17
A0 B0
A3 B3
A2 B2
A1 B1
A0 B0
Figure 25: a) The input template for addition of two n-bit binary numbers A and B. b)The output template for the addition of two n-bit binary numbers A and B.
E.2
Time Complexity.
√ O( n) Worst Case and O(log n) Average Case Run-Time The construction presented in this section represents a combination of the constructions presented in Sections 4 and 5. Thus, this runtime analysis combines elements from the time complexity proofs in C.2 and D.2. In the construction described here, each section on the scaffold precomputes the sum as if there were a carry from the previous section and also as if there were no carry from the previous section. In some cases, whether a carry bit is passed in from the section below is irrelevant and parts of the final sum may be generated before this information is available. If the most significant addend-pair bits (MSB) of a given section can immediately generate a carry out for propagation to the next section, they do so. When a carry in arrives from the previous section, final values are selected from the precompution step, if necessary. In this analysis of time complexity, we first analyze the run time of the precomputataion step, and then consider the run time of propagating carry bits from section to section up the scaffold. Consider the binary sequence T of length 2n √ composed of n-bit binary numbers A and B as defined in Section C.2. For this construction, divide T into 2n smaller sequences, or sections, S0 , S1 , · · · , S √n −1 , with 2 √ each section containing 2 n addend-pairs. These sections are arranged on a scaffold as depicted in Figure 29. Note that the distance between the bottom half and the top half of a given section is constant and requires 32
a)
1 1
E(1,x) E(1,x)
x
x
x OC0
0
E(1,x) E(1,x)
x
OCOut0*
1
1 1 1 1
c)
x
x CIn1
x
Ex
1
OCOut0
-x
CIn0 -x
0 1
1 0 1 0
x CIn1
-x CIn0
-x
CIn0* Ox
1 1
x
x
O1*
O1*
OE1
1
1
1
-x
-x
O1*
O1*
O1*
1
1
1
CIn1 Ox
0 1
0 0
1 0
0 0
0 1
1 0
1 1
1 1
-x
0 0
ECOut0
-x
1
x
O0*
O1*
O1*
0
1
1 0
-x
OE0*
S
-x
-x
-x
Ex
OMSB*
O1*
O1*
O1*
O1*
0
1
1
1
1
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 -x CIn0
OCOut0
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1
Ex
O1*
x CIn1
OCOut0 O1
-x
S
1
x
Ex
OC0
OC0
x
1 1
d)
0
-x
CIn0 CIn0
CIn0 x
CIn1 Ox
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1 -x
-x
-x
0 1
CIn0* x
-x CIn0
O1
0 0
CIn0 x
CIn1 Ox
O1*
1
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1
-x
0 0
OCOut0*
E(x,-x) E(x,-x)
x
S
Ex
OCOut0* -x
-x
EC0 x OC0
1
-x
OC0* OC0*
-x
EC1 Ox OC1
OCOut0*
-x
-x
CIn0* CIn0*
Ox
E-xE-x
ECOut0 ECOut0
x
-x
-x
CIn0* -x
E(0,x) E(0,x) E(x,-x)
x
OCOut0*
O1*
-x
Ex Ex
ECOut1 ECOut1
OC0* OC0*
O1*
-x
Ex
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 Ex Ex
CIn0*
0
-x -x
CIn0*
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1
E-xE-x
-x
x
S
E(x,-x)
-x Ox
-x
E(x,-x)
E(x,-x)
E(x,-x) E-xE-x
-x CIn0
CIn0 x
CIn1 Ox
CIn0* -x
-x
E(x,-x)
x CIn1
-x
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1 E(x,-x)
1 0
1
OCOut0*
-x
x
0 1
O1*
E-xE-x EC0
0 0
x
E(x,-x) ECOut0 ECOut0
EC1 Ox OC1
1 1
E(x,-x)
Ex Ex
ECOut1 ECOut1
b)
1 0
E(0,x) E(0,x)E(x,-x)
Ex Ex x
0 1
0 0
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1
O1
O1
OE1
CIn0 x
CIn0 x
1
1
1
1 1
0 0
0 1
1 0
1 1
1 O1* 1
x
Ex
ECOut0
-x
Ex
0 0
0 0
1 1
1 1
1 1
0 0
0 1
1 0
Figure 26: a) Carry out bits are computed for each addend pair where a carry out can be deduced immediately. b-c)“Predicted” results (yellow tiles) begin to attach. d) Blue tiles represent final result. 33
a)
1 1
0 0
0 1
1 0
x CIn1
1 1
1 1
-x CIn0
0 0
1
b) x
O0*
O1*
O1*
0
1
1 0
-x
-x
0 0
0 1
1 0
1 1
1 1
1 1
0 0
0 0
1
x CIn1
O0*
O1*
O1*
0
1
1
S
0
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1 -x CIn0*
1 1
-x
S
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1
Ex
1
1
1
1
E1*
E1*
E1*
E1*
Ex CIn0*
OMSB*
O1*
O1*
O1*
O1*
OMSB*
O1*
O1*
O1*
O1*
0
1
1
1
1
0
1
1
1
1
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 -x CIn0
0 0
1 1
1 1
1 1
0 0
0 1
1 0
1 1
0 0
0 1
1 0
1 1
1 1
1 1
0 0
0 0
1
1 1
Ex
0 0
1
c)
1 1
1 O1* 1
x CIn1
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1
1 1
O0*
O1*
O1*
0
1
1
1
1
1
1
E1*
E1*
E1*
E1*
0
EO* CIn1
O1*
O1*
O1*
CIn1 O1*
CIn1 OMSB*
0
1
1
1
1
0
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1
1
1 1
1 O1* 1 0 0
0 0
1 1
1 1
1 1
0 0
0 1
1 0
1 1
1 1
0 0
0 1
1 0
0 0
0 1
1 0
1 1
1 1
1 1
0 0
0 0
1 0
0 1
1 1
CIn1
0 1
1
1
1
1
E1*
E1*
E1*
E1*
0 1
0 1
0 1
0 0 1
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1
0 0
1 1
EO1
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1
OMSB*
1 1
0 0
1 1
1
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1
1
0 0
0 0
1
d)
1 1
1 O1* 1
1 1 1
1 1
1 O1* 1
1
0 0
0 0
0 0
1 1
1 1
1 1
0 0
0 1
1 0
Figure 27: a-b) Carry bits are propagated within sections from the MSB of the odd row to the LSB of the even row using vertical columns. c-d)Carry bits are propagated between sections via vertical columns on the east side of the assembly. 34
a)
1 1
0 0
0 1
1 0
1 1
1 1
1 1
0 0
0 0
1
1 0
0 1
b) 1 1
CIn1
0 1
1
1
1
E1* CIn1
E1*
E1*
E1*
1 0
0 1
0 1
0 1
1
1 1
1 O1* 1
0 1 0
0 1
0 0
0 0
1 1
1 1
1 1
0 0
0 1
1 0
1 0
1 1
1 1
1 1
0 0
0 0
1 0
0 1
1 1
CIn1
0 1
1 0
1 0 0 1
1 0 0 1
1 0 0 1
0 CIn1
0 1
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1
0 0
0 1
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1
1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1
0 0
1
1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1 1
1 1
1 1 1
1 1
1 O1* 1
1
0 0
0 0
0 0
1 1
1 1
1 1
0 0
0 1
1 0
Figure 28: a) A carry bit from the first section propagates through the middle section in this assembly. b) The terminal assembly displaying the output C. a constant number of steps to traverse. We first treat each section, Si , as an independent addition problem without regard for a carry in from the previous section. Define k as the longest contiguous sequence of (0, 1) and (0, 1) addend-pairs in Si . It follows from the proof in Section C.2 that the run-time for Si is bounded upwards by √the length of k. The worst-case runtime for addition over the sequence S √i would thus occur when k = 2 n. Therefore, the worst case run time for addition within a section is O( n). It also follows, as described in Section C.2, that the expected length of k is O(log n). Therefore, the average run time for √ n each addition within each section is O(log n). The precomputation within each of the sections occurs 2 √ independently in parallel, leading to a worst case O( n), average case O(log n) precomputation run-time for all sections. After performing this precomputation within each section, we must propagate carries between each section. The distance between two neighboring sections is constant and may be traversed by a column of tiles in a constant number of steps. Therefore, a carry out bit, once generated, can be propagated from one section to the next in constant time. A carry out bit may be propagated to the next section, Si+1 , immediately if the most significant addend pair of section Si consists of (0, 0) or (0, 1). Otherwise, this most significant addend pair of Si must wait for a carry in before generating a carry out. Therefore, the composition of the most significant addend-pairs of the sections acts as a limiting factor to the speed with which carries may be propagated across all of the sections. Consider the binary sequence P which is comprised of the most √ n significant addend pairs of each section and has a length of 2 addend-pairs. Let k be the longest contiguous sequence of (1, 0) and (0, 1) addend-pairs in P . The propagation of carry bits through P is bounded upwards √ by the length of k, with the worst-case being when k = 2n and an average case being when k = O(log n). Thus, the propagation of√ carry bits between each section up to the most significant addend-pair of T is bounded upwards by O( n) and has an average run-time of O(log n). Therefore, this addition algorithm
35
MSB
LSB MSB
LSB MSB
LSB MSB
LSB
Figure 29: Top: T is separated into sections. Bottom: Each section of T is arranged on a scaffold. Arrows represent the general direction of carry bit propagation. MSB denotes the most significant bit of a section, and LSB denotes the least significant bit of a section. √ has an upper bound of O( n) and an average run-time of O(log n). The parallel run time analysis presented above can be extended to the continuous run time model through application √ of Lemma C.2 and Lemma C.3 similar to what is done in Sections C.2.2 and D.2.2, yielding worst case O( n) and average case O(log n) continuous run time bounds.
E.3
Correctness.
Every row with north facing glues performs its addition in the exact same way as described in Section C.1 assuming an incoming carry of 0. As such we use the proof of correctness from Section C.3 to show that this step is correct. Also, every row with south facing glues performs this same addition except with an incoming carry from the addition of the lower north facing glue side. Thus, the same proof also applies to this side. Considering both of these additions as the first step, we can say that the first step correctly adds a section with a carry-in of 0. Since the first section has no sections before it, the addition of the first section has an input carry of 0 allowing the copying of its value up to the final “display” row. Once an addend-pair pair knows its carry-in it can display its results. This combined with the fact that we proved the addition of any section with an input carry of 0 proves the correctness for the first section. Once a section finishes the first step, there are three possible carry-outs from the MSB addend-pair in the section: 0, 1, and 0∗. If the section’s MSB addend-pair propagates a 0 or 1 then somewhere in the section some addend-pair generated its carry without depending on a carry from the previous section leading to a correct propagation. If the section’s MSB addend-pair propagates a 0∗, then no addend-pair in the section was able to generate a carry, meaning that no addend-pair in the section contained equal bits. Therefore, the section will correctly propagate whatever carry it received to the next section. These two clauses cover all cases in terms of carry propagation from one section to next section. Every section other than the first one performs its addition with an input carry of 0∗ which indicates an unknown carry with a value of 0. In other words, it continues the calculation assuming a 0 input carry but cannot copy any undetermined values, due to the unknown carry, into the “display” row until it receives the real carry from the previous section. This carry only gets propagated until it reaches an addend-pair with equal bits because at this point the addend-pair would have already generated its carry regardless of any incoming input carry and propagated it. One can see that the beginning of any section, other than 36
the first, is essentially calculated by the algorithm as if it was in the center of some series of addend-pairs with unequal bits. When the correct carry propagates from the previous section to the current section, the correct values may then be copied into the “display” row. If a 0 is carried in, then it is a simple copy up of the value into the “display” row. If a 1 is carried in, then it is the inverse of what was previously calculated. Assuming some section receives the correct carry from the previous section, that section will “display” the correct result. We have shown that the first section propagates a carry correctly, and therefore all subsequent sections propagate their correct carry out. We have also shown that if each section propagates the correct carry-out to the next section, then the addition of addend-pairs within each section is performed correctly, producing the correct sum of A and B, C.
F F.1
Multiplication Construction.
This TAC uses a shift-and-add algorithm for multiplication. Let A and B be two n-bit numbers, and P binary i let ai and bi represent the ith bits of A and B respectively. Note that if B = b 2 for b i i ∈ 0, 1, then i P the product AB = i Abi 2i . In this algorithm we perform the multiplication by multiplying A by each of the powers of two in B. Each of these partial products is added to the running sum as soon as the partial product is available. To describe our construction we focus on the case of multiplying two 64-bit numbers. Vector Label And Tile Types. To permit a high-level description of the multiplication TAC, we use a Vector Label technique to designate the binding possibilities between tiles. Rather than showing the glue types and their placement on each edge of a tile, we label a tile with a vector which describes the rules by which that tile may bind to adjacent tiles. A tile with such a vector actually describes a group of tiles that contains a certain element in the vector that could attach to a certain assembly. The number of tile types described by a Vector Labeled tileset is at most O(ld ), where l is the number of label types and d is the vector dimension. An example set of Vector Labeled tiles may be found in Figure 30. 0,0,r0 0,0,r1
0,1,r0 0,1,r1
1,0,r0 1,0,r1
1,1,r0 1,1,r1
a,b,r0 a,b,r1 a,b∈{0,1} a)
b)
Figure 30: a) A pair of Vector Labeled tiles describes a binding rule. Dimensions a and b must be the same on both tiles for the tiles to bind, and tiles containing r1 in the third dimension may bind only to tiles containing r0 in the third dimension. b)Such a binding rule can describe 8 tile types with 4 binding possibilities.
Input/Output Template. The input template encoding the multiplicand and multiplier, A and B, is a rectangular assembly of size 2n tiles. The product of two n-bit numbers will have at most 2n bits. This extra n bits of space comprises the west half of the assembly, as shown in Figure 31(a), while the east half encodes A and B via vector labeled tiles. The first and second elements of the vector on each tile are a given bit of A and B. The final assembly for 64-bit inputs is shown in Figure 31(b). The product AB is encoded on the top surface of the final assembly, marked as Output in Figure 31(b). Before going into details of the TAC implementation, we first give a brief overview of the different parts of the output structure (Fig. 31(b)). Given two n-bit inputs, A and B, our multiplication algorithm involves a 37
summation of n numbers, x0 + x1 + ... + xn−1 . For any number xk within these n numbers, xk = Abk 2k . The output assembly (Fig. 31(b)) has four horizontal layers, one stacked on top of the next, √ each of which sums 16 3 n−1 , of the 64 numbers to be summed for the 64-bit example. More generally, there are 3 n layers, l0 , l1 , ..., l √ √ √ 3 n + xk 3 n+1 + ... + x(k+1) 3 n−1 . On and each layer lk is responsible for summing n2/3 of the n numbers, xk √ √ each layer, there are 3 n columns running west to east, each of which sums n1/3 numbers of the n2/3 numbers which are summed within each layer. _ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
a56
a55
b56
b55* b40
b39* b24
b23* b8
b 7*
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
a57
a54
a41
a38
a25
a22
a9
a6
b57
b54
b41
b38
b25
b22
b9
b6
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
a58
a53
a42
a37
a26
a21
a10
a5
b58
b53
b42
b37
b26
b21
b10
b5
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
a59
a52
a43
a36
a27
a20
a11
a4
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
_ _
a40
a39
a24
a23
a8
a7
b59* b52
b43* b36
b27* b20
b11* b4
a60
a51
a44
a28
a12
b60
b51* b44
a35
b35* b28
a19
a3
b19* b12
b 3*
a61
a50
a45
a34
a29
a18
a13
a2
b61
b50
b45
b34
b29
b18
b13
b2
a62
a49
a46
a33
a30
a17
a14
a1
b62
b49
b46
b33
b30
b17
b14
b1
a63
a48
a47
a32
a31
a16
a15
a0
b63* b48
b47* b32
b31* b16
(a) Multiplication input template assembly.
b15* b0
(b) Multiplication output assembly.
Figure 31: The input seed and the output assembly for 64-bit inputs, A and B.
Part One: Deployment. An important step in the multiplication process is deploying the n numbers to be summed, x0 + x1 + ... + xn−1 . These numbers must be moved into different positions for addition to take place. First, the seed (Fig. 32(a)) is copied both up and south, as is depicted in Figure 32(b). Figure 37 demonstrates that for any 1-dimensional input, it is possible to create a tile set to send the input information into two directions. This same principle can be used to copy the 2-dimensional input area both up and south into three dimensions as is shown in Figure 32(b). The copy of the seed to the south will then be copied to the south again and to the east, as is shown in Figure 32(c). This copy to the east √ will head the formation of a column. Columns run west to east in each layer, and each layer contains 3 n √ columns. In this particular 64-bit input example, there will be four columns to each layer, each of length 3 n blocks. The input information will continue to be propagated south and east until all columns for a given layer are formed. The input information that was copied up (Fig. 32(c)) will act as a seed for the next layer, which is formed in the same way that the first layer was formed. As the input information is copied in different directions, different functions are applied. For the input 2 information copied up, n 3 bits of zero are attached after the least significant bit of A, and the least significant 2 n 3 bits of B are removed. Let p(x) = x ∗ 2 and q(x) = [x/2]. Then in figure 32(c), A = p(A)16 means attach 16 zeros after A and B = q(B)16 means remove the 16 least significant bits from B. We add and remove these bits by shifting numbers. Figure 38(a) demonstrates that it is possible to create a tile set to shift any input. The copy of the input information to the east marks the head of the first column. As this column is propagated from west to east, the least significant bits of B are removed one by one, and bits of 0 are appended to the least significant end of A one by one, as is seen in Figure 33(a). At the end of the column, a total of n1/3 0 bits will have been appended to A and n1/3 of the least significant bits of B will have been removed. The copy of the input to the south propagates further to the south and then east to form the next column (Figure 33(b)-33(c)). During this southward propagation, the least significant n1/3 bits of B are removed, 38
and n1/3 bits of zero are appended to the end of A. This southward propagation continues, and each time n1/3 additional bits of zero are appended to the least significant end of A and n1/3 least significant bits are √ 3 removed from B, until the heads of each of the n columns have formed on the east side. These columns are then free to grow eastward (Figure 34(a)). The copy of the seed that is sent upward has the least significant n2/3 bits removed from B and n2/3 bits of 0 appended to the least significant end of A. This is the beginning of the next layer. As the information is propagated upward, it may continue to form layers, which form in parallel (Figure 33(c)). Each time the information is propagated upwards to form the next layer, another n2/3 bits of 0 are appended to the least significant end of A, and the least significant n2/3 bits are removed from B.
6* 16
8
16 8
(a) Seed assembly for multiplication.
(b) The seed information is propagated up and to the south.
(c) Seed information is again propagated upward to form the next layer, this time with 16 bits of 0 appended after the least significant bit of A, and 16 of the least significant bits of B removed.
Figure 32: This series shows the propagation of information south for use in the formation of columns and up to form a new layer.
10 9
(a) This partial assembly shows the input information being modified and propagated upward for the next layer, eastward for the formation of a new column, and southward to generate more columns.
(b) This partial assembly shows the beginning of the formation of the next layer.
(c) This partial assembly shows the continued formation of new layers, and the columns within those layers.
Figure 33: This series shows the process of forming new columns and layers.
39
Part Two: Addition. While shifting bits between each layer, and between and within each column, the tile set simultaneously performs the additions. The addition algorithm here is based on Section 5. Recall that our multiplication algorithm involves a summation of n numbers, x0 + x1 + ... + xn−1 . For any number xk within these n numbers, xk = Abk 2k . An example of a two-number multiplication, which is required to attain xk , is shown in Figure 36. The five elements in the vector label are the current bit of A, the current bit of B, the predicted result without a carry in, the predicted result with a carry in, and the current bit result. The numbers are arranged in zigzag fashion. After obtaining the result of each column, shown in Figure 34(a), we copy the result of the first column, rotate this result to the south, and send it to its neighbor (Figure 34(b)). Note that we don’t need A and B any more; we only copy the result. Using the reversed design of the rotate and copy box (Figure 37), we merge the two inputs from two sides into one single output. Using the same principle shown in Section 5 and Figure 36 we then sum all results from each column together, as shown in Figure 34(c) and Figure 34(d).
(a) Sums within each column have been calculated.
(b) The sum of the numbers within each column is propagated south.
(c) The sums of the numbers within each column are totaled.
(d) The sum of all the numbers within the layer is presented on the south side of the layer.
Figure 34: This series shows the process of summing all numbers within each layer. The result of the summation in the first layer will be copied up to the second layer (Figure 35(c)), where it is added to the result from the second layer. This process is repeated until the last layer is reached and a final result is obtained, as shown in Figure 35(b) and Figure 35(c).
(a) Completed formation of the first layer of the assembly. Note that the other layers may form in parallel.
(b) All layers within the assembly for the 64-bit inputs are completed.
(c) The sum of the numbers within each layer is propagated northward and added to a running total of the sums of each layer.
Figure 35: After summing the results of each layer, the multiplication result is displayed on the top surface of the assembly.
F.2
Time Complexity. 5 6
O(n ) Worst Case Run-Time. Our multiplication construction consists of copy (Figure 37), shift √ (Figure 38(a)), and addition (Figure 36) operations. The addition algorithm here is a 3D version of our O( n) Worst 40
Case Addition TAC (Section D.2). In order to ease the analysis we will assume that each logical step of the algorithm happens in a step-by-step fashion even though parts of the algorithm are running concurrently. Runtime For Copy Block. The the input number up and south using a copy block. The √ √ system copies input for a copy block is an O( n) by O( n) rectangle. For each row running north to south a copy is generated independently, so we analyze the process in a longitudinal section and then expand it to the entire block. Figure 37 shows an example of a copy box copying one row up and to the left. The copy box starts the copy from the leftmost input. This tile is marked as “r”. If a tile is marked as “r”, the tile on its right will mark its north tile as “r” , otherwise, the tile will just copy itself to the north. After a tile receives an “r”, the tile on its north will then send the label both up and to the left. After that, the tiles sending labels to the left will also copy the label from the bottom to the top, thus sending the labels up. Each tile under the “r” tiles is√dependent on the tiles on its left and bottom, so the runtime for all tiles under the attached diagonal is O( n). The tiles after the “r” tiles√are dependent on the tiles on the right and bottom, so the total runtime of this rotate and copy box is O( n). Runtime For Shift Block. The system shifts the input numbers in order to attach or remove bits using a Shift Block. An example of shifting one row of input to the left is shown in Figure 38(a). In this example, bits are shifted to the left. So, from the left, the assembly starts to grow on each of the even tiles of the input at the same time, copies each of the even inputs to the second layer, then grows to the left and combines the input from each of the odd tiles in parallel. Then, the tiles on top of the odd input tiles contain both the information of the input tiles below, as well as the input tile to its right. This tile can then repeatedly be copied up and left, until it reaches the desired position. The runtime of shifting the block will be the distance it moved plus 2. In our algorithm, the length of shifting the√ input for the next addition is no longer than the length of the input rectangle, so the runtime is at most O( n). Runtime For Addition Block. Figure 36 shows an example of two-number multiplication, which is per√ formed by stacking several Addition Blocks and Shift Blocks. √ Recall that in the O(log n) Average Case, n Worst Case Addition (Section 5), we achieved a runtime of O( n) for a two number addition in 2D. We use the same principle to build the Addition Block in 3D. Instead of growing to the center of every two rows in 2D, the Addition Block gets its result on top of the input layer. For simplicity, we √ abandon the O(log n) average case addition mechanism here. Therefore, the Addition Block will take O( n) time to finish. In the first logical step, the assembly repeatedly stacks up the rectangular Copy blocks √ and Shift blocks to form a vertical column, which is used to deploy the desired numbers to all of the O( 3 n) layers. In this vertical column, each layer contains a√ Copy block, and √ between each is a Shift block, so the √ √ layer there 5 runtime for this vertical column is O( n) × O( 3 n) + O( n) × O( 3 n) = O(n 6 ). In the second step, which occurs in each layer of the assembly, a north-south row starts to grow from the copy block in the vertical column, √ which is also the northwest corner of each layer, in order to deploy the desired numbers to all of the O( 3 n) columns by horizontally stacking the copy blocks and shift blocks. This √ √ 5 process takes O( 3 n) copy blocks and O( 3 n) shift blocks, so the runtime for this north-south row is O(n 6 ). In the third logical step, starting from each copy box in the north-south running row in each layer, √ columns grow eastward in order to deploy numbers while simultaneously adding them. There will be O( 3 n) √ 5 additions and O( 3 n) shift blocks, so the runtime for each east-west columns is O(n 6 ) Then, the last block of the first column in each layer will rotate its information to the south and propagate it to the end of the next column, merging the next result and adding them together, then repeating these moving and adding steps until the last result of the layer is obtained. Finally, the result in the bottom layer rotates up and the results from each layer are added together. The runtime of these moving and adding 5 5 steps is O(n 6 ). Therefore, the total runtime of the multiplication process is O(n 6 ).
F.3
Correctness.
Note that the multiplication of two n-bit numbers A and B is the summation of a string of numbers C1 ... Cn , where Ck = A ∗ [Bk ] ∗ [2k ]. The order in which the numbers are summed is as follows: numbers are summed starting on the first layer from from north to south, then, with the same order in each layer, from the bottom layer to the top layer. Within each layer, the west-to-east growing columns sum numbers
41
0.llSeed:lZ=0
9.AAllZ=6f0A6=6
18.AAllZ=6f1A6=12
_g_g_g_g_
1g1g_g_g_
1g1Ag_g_g_ 1g1g_g_g_
1g_g_g_g_
0g_g_g_g0
1g1g_g_g0
0g1Ag_g_g0
0g_g_g_g1
0g_g_g_g0
0g0g_g_g1
0g1g_g_g0
_g_g_g_g_
0g0g_g_g_
0g1g_g_g_
_g_g_g_g_
0g1g_g_g0
1g0g_g_g0
0g1g_g_g0
1g_g_g_g_
1g_g_g_g0
1g1g_g_g1
0g1Ag_g_g0
0g0g_g_g_
1.llZ=1f0A6=1
10.llZ=1f1A6=7
19.llZ=1f2A6=13
b=0 0ug0ug0g1g_
b=1
0ug0ug_g_g0
2.ll
0ug1ug0g1g0
0ug1ug_g_g0
1ug1ug_g_g0
0ug0ug0g1g_
0ug0ug_g_g0
0ug_g0g0g0
0ug1Aug_g_g0
0ug1ug0g1g0
0ug1ug_g_g0
_g_g_g_g_
_g_g_g_g_
_g_g_g_g_
0ug0ug0g1g_ 0ug1ug_g_g0 0ug0ug_g_g0
b=0
b=1A b=1A 0ug_g1g1g1 0ug_g0g1g0 0ug0ug_g_g1 0ug1ug_g_g0
b=1 _g_g_g_g_
b=1 0ug1Aug_g_g0
1ug_g1g0g0
22.ll
b=1 b=1 1ug_g1g1g_ 0ug_g0g0g0 1ug1ug_g_g1 0ug1Aug_g_g0
b=0
5.llZ=2f0A6=2
b=1A
0ug1ug_g_g0
13.ll b=0 b=0 1ug1ug1g1g_ 1ug1Aug_g_g0 1ug1ug_g_g0
0ug1Aug_g_g0
b=1 0ug1ug0g1g0
0ug0ug_g_g0
4.ll
1ug_g1g0g0
b=1A
21.ll
b=1 b=1 1ug_g1g1g_ 0ug_g0g0g0 1ug1ug_g_g1 0ug1Aug_g_g0
b=0 0ug0ug0g1g_
0ug1ug_g_g0
b=1
12.ll b=0 b=0 1ug1ug1g1g_ 1ug1Aug_g_g0 1ug1ug_g_g0
0ug_g0g1g0
b=1
b=0
3.ll
b=1A 0ug1Aug_g_g0
20.ll
11.ll b=0 1ug1ug1g1g_
1ug_g1g0g0
b=1A b=1A 0ug_g1g1g1 0ug_g0g1g0 0ug0ug_g_g1 0ug1ug_g_g0 b=1A c
b=1
0ug1ug0g1g0 1ug0ug_g_g1 0ug1ug_g_g0
14.llZ=2f1A6=8
23.llZ=2f2A6=14 b=1A
1ug1Aug_g_g0 1ug1ug_g_g0
1ug1ug_g_g1 0ug1Aug_g_g0
b=1A b=1A c 1ug1ug_g_g0 0ug1Aug_g_g0
1ug0ug_g_g1 0ug1ug_g_g0
15.ll
24.ll b=1A
1ug1Aug_g_g0 1ug1ug_g_g0 b=0
1ug1ug_g_g1 0ug1Aug_g_g0 b=1
b=0
0ug0ug_g_g0 0ug1ug_g_g0 0ug0ug_g_g0
b=0 _g_g_g_g_ b=0 _g_g_g_g_
b=1
b=0 1ug1ug_g_g0 1ug1Aug_g_g0 1ug1ug_g_g0 b=0
b=0
0ug0ug_g_g0 0ug1ug_g_g0 0ug0ug_g_g0
8.AAllZ=4f0A6=4
b=1A b=1A c b=1A c 1ug_g_g_g0 1ug1ug_g_g0 0ug1Aug_g_g0
b=1
25.A b=1
1ug_g_g_g1 0ug_g_g_g0 1ug1ug_g_g1 0ug1Aug_g_g0 b=1 _g_g_g_g_
b=1A
0ug0ug_g_g1 0ug1ug_g_g0
0ug1ug_g_g0 1ug0ug_g_g1 0ug1ug_g_g0
16.Al
7.Al
b=1A
0ug0ug_g_g1 0ug1ug_g_g0
b=1
b=0 0ug1ug_g_g0 0ug0ug_g_g0
6.ll
b=1A
1ug_g1g1g_ 1ug_g1g0g0 1ug1ug_g_g0 0ug1Aug_g_g0
b=1
b=1
0ug1ug_g_g0 1ug0ug_g_g1 0ug1ug_g_g0
17.AAllZ=4f1A6=10
b=1A b=1A b=1A b=1A 0ug_g_g_g1 0ug_g_g_g1 0ug0ug_g_g1 0ug1ug_g_g0 b=1A b=1A b=1A c b=1A c 1ug_g_g_g1 1ug_g_g_g0 1ug1ug_g_g0 0ug1Aug_g_g0
26.llZ=4f2A6=16 0g1ug_g_g1 0g1Aug_g_g0
1g_g_g_g_
0g1ug_g_g0 1g1Aug_g_g0 0g1ug_g_g0
0g_g_g_g1
0g_g_g_g0
_g_g_g_g_
0g0ug_g_g0 1g1ug_g_g0 0g0ug_g_g0
1g_g_g_g_
1g1ug_g_g0 1g0ug_g_g1 0g1ug_g_g0
1
1
1
0
1
0
0
0
AllThelthreelnewlylattachedltileslinlstepl7.l16.landl25.larelattachedlinlcounterclockwiselorder.ll AAlAsldescribedlinlthelsectionloflshiftingltilesglthelinformationlinltheltileslcouldlbelshiftedlsimultaneouslyltolitslneighborlinl4lstepsg llllandltookl2llayerslofltileslspacelbeyondlthelorigionalltiles.
Figure 36: This sequence (Parts 1-26) demonstrates the product of A and B, where A = 100110 and B = 1011 ∗ 10. Z denotes the number of layers shown in the figure. The effective bits of B are the bits after and including the bit with the ∗ symbol. So the actual multiplication is 100110 × 110 = 11100100. Parts 1-7 show the first addition block. Part 8 shows that A is been shifted one bit ahead after two layers, and Part 9 shows B has been shifted to another direction after another two layers. 42
a1
b1 a0 a0
a0
a0
b0 b1 b1
a1 b1 a1 b0 a0
a 0r a 0r r r
b 0r b 0r
a1 a1 r
a1 a1
a1
b1
b0 b0 b0 b0
b0 b0
b0
a0 a0 a0
a0 a0 a1 a1
b1 b1
b1 a1 b0 a0
b0
a0 a0 c c
r
b1
a 1r
b0
a0
b1
a 1r
b0
a0
r
b1
c
r
c
a1
b1
c c a0
b0
Figure 37: There exists a tile set to copy any input (bottom row) in two directions. b3
b2
b1
b0
b3 a3 b2 a2 b1 a1 b0 a0 b3
b2
b1
b0
b3 a3 b2 a2 b1 a1 b0 a0 a3 b3
a3
a2 b2
a2
a1 b1
a1
a0 b0
a0
a3 b3
b3 a3 b2 a2 b1 a1 b0 a0 (a) For any input (bottom row), there exists a tile set to shift the input one tile to the left in 4 parallel steps.
a3
a2 b2
a2
a1 b1
a1
a0 b0
a0
b3 a3 b2 a2 b1 a1 b0 a0 (b) For any input (bottom row), there exists a tile set to shift the input n tiles to the left in n+2 parallel steps
Figure 38: The above two figures show how an input can be bit-shifted. beginning with the northmost column and ending with the southmost column. The head of each column √ contains A ∗ [2i ] and bi , where i is the number of columns times 3 n, 1 √