8.2 An example engineering change application: performing graph col- oring of a ...... It is important to notice the optimization trade-off involved in finding the.
University of California Los Angeles
Constraint Manipulation Techniques for Synthesis and Verification of Embedded Systems
A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science
by
Darko Kirovski
2001
c Copyright by Darko Kirovski 2001
The dissertation of Darko Kirovski is approved.
Songwu Lu
Richard R. Muntz
D. Stott Parker, Jr.
Mani B. Srivastava
Miodrag Potkonjak, Committee Chair
University of California, Los Angeles 2001
ii
To my parents, Verica and Risto . . .
iii
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Emerging Embedded Systems . . . . . . . . . . . . . . . . . . . .
1
1.2
System-on-Chip Debugging . . . . . . . . . . . . . . . . . . . . . .
4
1.2.1
Cut-Based Functional Debugging . . . . . . . . . . . . . .
4
1.2.2
Symbolic Debugging . . . . . . . . . . . . . . . . . . . . .
9
1.2.3
Error Correction: Constraint-Based Engineering Change .
11
Intellectual Property Protection and Copyright Enforcement . . .
13
1.3
1.3.1
Watermarking Solutions to Logic Synthesis Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2
Local Watermarks: Methodology and Application to Behavioral Synthesis . . . . . . . . . . . . . . . . . . . . . . .
1.3.3
15
16
Forensic Engineering of Solutions to Combinatorial Optimization Problems and an Application to Design Automation 18
1.4
Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2 State-of-the-Art in Related Work . . . . . . . . . . . . . . . . . .
21
2.1
Design Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.1.1
System Simulation . . . . . . . . . . . . . . . . . . . . . .
22
2.1.2
System Emulation . . . . . . . . . . . . . . . . . . . . . .
23
2.2
Symbolic Debugging . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.3
Engineering Change . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.4
Intellectual Property Protection . . . . . . . . . . . . . . . . . . .
28
iv
3 Modern Design Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1
Design-for-Debugging . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1
31 37
Design-for-Debugging for Programmable and Statically-Scheduled Computation Platforms . . . . . . . . . . . . . . . . . . .
40
3.2
Pre-processing for Symbolic Debugging . . . . . . . . . . . . . . .
46
3.3
Pre- and Post-processing for Engineering Change . . . . . . . . .
47
3.3.1
50
Constraint-based Intellectual Property Protection . . . . .
4 Improving the Observability and Controllability of Datapaths for Emulation-based Debugging . . . . . . . . . . . . . . . . . . . . . . . . 4.1
53
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.1.1
Motivational Example . . . . . . . . . . . . . . . . . . . .
57
4.1.2
Computation and Hardware Model . . . . . . . . . . . . .
59
4.2
The New Approach: Cut-Based Integrated Debugging . . . . . . .
61
4.3
Synthesis for Debugging . . . . . . . . . . . . . . . . . . . . . . .
64
4.3.1
Background Definitions . . . . . . . . . . . . . . . . . . . .
64
4.3.2
Cut Selection . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.3.3
Variable Scheduling . . . . . . . . . . . . . . . . . . . . . .
78
4.3.4
Variable-to-Port Scheduling . . . . . . . . . . . . . . . . .
80
4.4
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5 Cut-based Functional Debugging for Programmable Systems-on-Chip . . . . . . . . . . . . . . . . . . . . .
v
88
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.1.1
Motivational Example . . . . . . . . . . . . . . . . . . . .
91
5.2
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.3
Debugging and Real-Time Cut Export . . . . . . . . . . . . . . .
97
5.3.1
ASIC Design-for-Debugging . . . . . . . . . . . . . . . . .
99
5.3.2
Code Compilation-for-Debugging . . . . . . . . . . . . . .
99
5.3.3
Integration-for-Debugging . . . . . . . . . . . . . . . . . . 101
5.4
Design-for-Debugging: Algorithms . . . . . . . . . . . . . . . . . . 102 5.4.1
Code Instrumentation for Cut I/O . . . . . . . . . . . . . 102
5.4.2
Cut Selection and Register-to-Port Interconnection . . . . 104
5.4.3
Cut Scheduling of Multiple Cores . . . . . . . . . . . . . . 107
5.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6 Symbolic Debugging of Optimized Behavioral Specifications for Fast Variable Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.1.1
Motivational Example . . . . . . . . . . . . . . . . . . . . 118
6.2
Computational and Hardware Model . . . . . . . . . . . . . . . . 120
6.3
Design for Symbolic Debugging . . . . . . . . . . . . . . . . . . . 122 6.3.1
Selection of Optimal Golden Cuts . . . . . . . . . . . . . . 122
6.4
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
vi
7 Non-Intrusive Symbolic Debugging of Optimized Behavioral Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.1.1
Motivational Example . . . . . . . . . . . . . . . . . . . . 132
7.2
Computational and Hardware Model . . . . . . . . . . . . . . . . 135
7.3
Design for Symbolic Debugging . . . . . . . . . . . . . . . . . . . 135 7.3.1
Debugging Optimized Behavioral Specifications . . . . . . 136
7.3.2
Selection of Optimal Golden Cuts . . . . . . . . . . . . . . 138
7.3.3
Discussion of Cut Validity After Applying Transformations 144
7.4
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8 Engineering Change: Methodology and Application to Behavioral and System Synthesis . . . . . . . . . . . . . . . . . . . . 148 8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.1.1
8.2
Motivational Example . . . . . . . . . . . . . . . . . . . . 151
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.2.1
Hardware and Computational Model . . . . . . . . . . . . 154
8.2.2
Targeted Behavioral Synthesis Tasks . . . . . . . . . . . . 155
8.3
The New EC Methodology . . . . . . . . . . . . . . . . . . . . . . 156
8.4
The EC Algorithms for Behavioral Synthesis . . . . . . . . . . . . 158
8.5
8.4.1
Register Allocation and Binding . . . . . . . . . . . . . . . 159
8.4.2
Operation Scheduling . . . . . . . . . . . . . . . . . . . . . 165
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 169
vii
8.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9 Intellectual Property Protection by Watermarking Combinational Logic Synthesis Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 173 9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2
Watermarking Desiderata . . . . . . . . . . . . . . . . . . . . . . 175
9.3
Watermarking Logic Synthesis Solutions . . . . . . . . . . . . . . 176 9.3.1
Gate Ordering . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.3.2
Watermark Encoding and Embedding . . . . . . . . . . . . 180
9.3.3
Persistence to Possible Attacks . . . . . . . . . . . . . . . 183
9.4
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10 Local Watermarking: Methodology and Application to Behavioral Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 10.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 10.2.1 Hardware and Computational Model . . . . . . . . . . . . 193 10.2.2 Targeted Behavioral Synthesis Tasks . . . . . . . . . . . . 194 10.3 Global Flow: IPP for Behavioral Synthesis . . . . . . . . . . . . . 195 10.4 IPP Protocols for Behavioral Synthesis . . . . . . . . . . . . . . . 196 10.4.1 Operation Scheduling . . . . . . . . . . . . . . . . . . . . . 196 10.4.2 Template Matching . . . . . . . . . . . . . . . . . . . . . . 204 10.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 209
viii
10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11 Forensic Engineering Techniques for VLSI CAD Tools . . . . . 213 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 11.2 Existing Methods for Establishing Copyright Infringement . . . . 215 11.3 Forensic Engineering: The New Generic Approach . . . . . . . . . 217 11.4 Forensic Engineering: Statistics Collection . . . . . . . . . . . . . 219 11.4.1 Graph Coloring . . . . . . . . . . . . . . . . . . . . . . . . 219 11.4.2 Boolean Satisfiability . . . . . . . . . . . . . . . . . . . . . 225 11.5 Forensic Engineering: Algorithm Clustering and Decision Making
228
11.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 231 11.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
ix
List of Figures
1.1
Embedded systems: examples of ”buzzword-to-reality” applications that have recently become investment targets with large and high-margin profitable businesses. . . . . . . . . . . . . . . . . . .
1.2
3
Key technology trends imposing that difficulty of debugging will likely become more difficult in the future. Pin and gate count are tabulated for processors in past three decades and corresponding trends are presented in the graph. . . . . . . . . . . . . . . . . . .
1.3
5
The new concept of functional debugging. The running design periodically outputs the cut state, which is stored in a database. Any one of these states can be used to initialize, and then continue execution with preserved functional and timing accuracy. . . . . .
3.1
8
Cut-based debugging: an exemplary process of outputting cut variables of all cores (both programmable and application-specific) in the system through a common bus structure. . . . . . . . . . . . .
3.2
41
A generic system architecture for the developed debugging platform. It consists of individual cores, the embedded software running on these cores, an inter-core bus network, and a set of protocols for core intercommunication. . . . . . . . . . . . . . . . . . .
3.3
Global design flow for the developed design for symbolic debugging methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
44
47
The design flows for two developed engineering change methodologies: design for engineering change and post-processing for engineering change. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
49
3.5
The protocol for hiding information in solutions for multi-level logic optimization and technology mapping. . . . . . . . . . . . .
4.1
51
The new concept of functional debugging. The running design periodically outputs the cut state, which is stored in a database. Any one of these states can be used to initialize, and then continue execution with preserved functional and timing accuracy. . . . . .
4.2
55
Optimal cut example. (a) CDFG and (b) allocated, assigned, and scheduled CDFG for the fifth order CF IIR filter. Subfigure (b) depicts two cuts: C1 = {IN, D1, D2, D3, D4, D5} with dotted edges and C2 = {IN, A2, A4, A6, A8, A10} with bold edges.
4.3
. . .
57
An example of a scheduled and assigned control data flow graph and the accompanying definitions. Primary inputs and outputs, state delays, data operations, data precedence edges, register assignment, variable write life-time, and a complete cut example are illustrated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.4
Pseudo-code for the cut search algorithm. . . . . . . . . . . . . . .
69
4.5
Construct ISG(CDF G) - pseudo-code for construction of the input sensitive graph. . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6
70
Input Sensitive T ransitive Closure(ISG, CDF G) - pseudo-code for computing the input-sensitive closure of the CDFG. . . . . . .
xi
70
4.7
Example of an input sensitive graph which corresponds to the CDFG shown in Figure 4.3. Each node corresponds to a data operation Ni in the original CDFG and has a set of inputs which correspond to the operands of Ni . The edges in the graph are either inherited from the original CDFG or created using the inputsensitive transitive closure procedure. . . . . . . . . . . . . . . . .
71
4.8
Input Sensitive Dominating Set(ISG). . . . . . . . . . . . . . . .
73
4.9
Pseudo-code for Optimal Cut-set for Debugging (II) search. . . . .
75
4.10 Pseudo-code for graph compaction. . . . . . . . . . . . . . . . . .
76
4.11 The unscheduled (a) and scheduled and assigned (b) control data flow of a third order Gray-Markel ladder filter. . . . . . . . . . . .
77
4.12 Finding the cut-set of the third order Gray-Markel ladder IIR filter. Subfigures (a,b,c) demonstrate the node merger procedure. Subfigures (d,e) illustrate the removal of a node from the set of SCCs and its inclusion in the set of selected cut variables. . . . . .
78
4.13 Pseudo-code for the cut-set output scheduling heuristic. . . . . . .
80
4.14 Example of output scheduling. . . . . . . . . . . . . . . . . . . . .
81
4.15 Pseudo-code for the variable-to-port scheduling heuristic. . . . . .
82
5.1
The 5th order CF IIR filter. Motivational example: system component core scheduled, allocated, and assigned CDFGs. . . . . . .
5.2
Motivational example: ASIC architectures for the 5th order CF IIR and 3rd order Gray-Markel ladder filter. . . . . . . . . . . . .
5.3
92
93
Motivational example: Unfolded CDFG over three consecutive iterations shows how variables D1, D2, and D3 are computed. . . .
xii
94
5.4
The targeted system: individual core architecture, embedded software, and core intercommunication. . . . . . . . . . . . . . . . . .
5.5
96
View at the process of outputting the cut variables of all cores in the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
5.6
Pseudo-code for PC CUT SELECTION search. . . . . . . . . . . 104
5.7
Pseudo-code for the debugging prospective register subset search.
5.8
Pseudo-code for the cut selection and scheduling algorithm. . . . . 110
6.1
Part of the optimized program without considering debugging. . . 119
6.2
Part of the optimized program by our proposed design-for-debugging
107
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3
A motivational example for the proposed design-for-debugging method.121
6.4
The pseudocode of the basic heuristic for the golden cut problem. 125
6.5
Modeling a hyperedge in flow network. . . . . . . . . . . . . . . . 125
6.6
The construction process of a flow network for the “green” subgraph: the flow network. . . . . . . . . . . . . . . . . . . . . . . . 126
6.7
The pseudocode of the iterative improvement heuristic for the golden cut problem. . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.1
An example of trade-offs involved in selection of cut variables such that optimization potential of the computation is not impacted. . 133
7.2
Global flow of the DfD and symbolic debugging process. . . . . . 137
7.3
Pseudo-code for the developed algorithm for The Complete Golden Cut problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
xiii
7.4
Performing the steps of a single iteration of the cut-set selection procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.1
Design-for-EC: two resource allocation and scheduling solutions with different resilience to errors. . . . . . . . . . . . . . . . . . . 152
8.2
An example engineering change application: performing graph coloring of a corrected specification only on the updated subgraph. . 153
8.3
The design flows for design-for-EC and post-processing for EC. . . 157
8.4
A second order Gray-Markel ladder filter: CDFG, its scheduling and the corresponding interval graph. . . . . . . . . . . . . . . . . 159
8.5
Procedure used to embed edges of type-I into an interval graph. . 161
8.6
An example of addition of type-I constraints to the graph coloring problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.7
Procedure used to embed edges of type-II into an interval graph. . 163
8.8
An example of addition of type-II constraints to the graph coloring problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.9
Procedure used to perform the error correction process with minimal hassle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.10 Post-processing for EC: graph bipartitioning, constraint manipulation, and coloring. . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.11 Procedure used to perform the design-for-EC process for operation scheduling solutions. . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.12 Constraint augmentation and manipulation for pre- and post-processing for EC of operation scheduling. . . . . . . . . . . . . . . . . . . . 169
xiv
9.1
The protocol for hiding information in solutions for multi-level logic optimization and technology mapping. . . . . . . . . . . . . 176
9.2
Proposed function for completely defined node ordering.
. . . . . 178
9.3
An example of ordering nodes according to the proposed set of sorting criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.4
Proposed function for watermarking multi-level logic minimization solutions using network augmentation. . . . . . . . . . . . . . . . 181
9.5
A top-down and bottom-up approach to finding attacker’s signature in an already watermarked solution. . . . . . . . . . . . . . . 185
10.1 The global flow of the generic approach for local watermarking behavioral synthesis solutions. . . . . . . . . . . . . . . . . . . . . 195 10.2 Pseudo-code of the proposed protocol for local watermarking of operation scheduling solutions: Subtree sT identification.
. . . . 198
10.3 Pseudo-code of the proposed protocol for local watermarking of operation scheduling solutions: Constraint Encoding for Operation Scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 10.4 Pseudo-code of the proposed protocol for local watermarking of operation scheduling solutions: CDFG Node Ordering. . . . . . . 200 10.5 An example of local watermarking scheduling solutions: fourth order parallel IIR filter. . . . . . . . . . . . . . . . . . . . . . . . . 204 10.6 Pseudo-code of the proposed protocol for constraint encoding during local watermarking of template matching solutions. . . . . . . 206 10.7 An example of local watermarking template matching solutions: fourth order parallel IIR filter. . . . . . . . . . . . . . . . . . . . . 207
xv
11.1 Global flow of the forensic engineering methodology. . . . . . . . . 217 11.2 Example of the DSATUR algorithm. . . . . . . . . . . . . . . . . 221 11.3 Example of the RLF algorithm. . . . . . . . . . . . . . . . . . . . 222 11.4 Example of two different graph coloring solutions obtained by two algorithms DSATUR and RLF. The index of each vertex specifies the order in which it is colored according to a particular algorithm. 224 11.5 Forensic engineering: pseudo-code for the algorithm clustering procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 11.6 Two different examples of clustering three distinct algorithms. The first clustering (figure on the left) recognizes substantial similarity between algorithms A1 and A3 and substantial dissimilarity of A2 with respect to A1 and A3 . Accordingly, in the second clustering (figure on the right) the algorithm A3 is recognized as similar to both algorithms A1 and A2 , which were found to be dissimilar. . . 230 11.7 Experimental results obtained for forensic engineering of graph coloring and SAT. Figure continues over next 8 pages. Detailed explanation of each figure in the experimental results subsection. . 233
xvi
List of Tables
4.1
Application of the design-for-debugging step to a set of standard benchmarks for estimation of hardware overhead (according to the first definition of a cut). . . . . . . . . . . . . . . . . . . . . . . .
4.2
86
Application of the design-for-debugging step to a set of standard benchmarks for estimation of hardware overhead (according to the second definition of a cut). . . . . . . . . . . . . . . . . . . . . . .
87
5.1
Table of cuts for the 3rd order Gray-Markel ladder filter. . . . . . 108
5.2
Debug information for the implementation of a number of ASICs. 112
5.3
The efficiency of the cut selection and scheduling approach is tested on a set of ASIC core-mixes. . . . . . . . . . . . . . . . . . . . . . 113
5.4
Total number of variables and cut cardinalities for a set of multimedia benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.1
Golden Cut Sizes 1, 2, and 3 are obtained for values of k in the linear program, such that the final query time is 0.5, 0.25, and 0.125 of initial query time, respectively. . . . . . . . . . . . . . . . 128
6.2
Golden Cut Sizes 1, 2, and 3 are obtained for the value k in the linear program, such that the final query time is 0.5, 0.25, and 0.125 of initial query time, respectively. . . . . . . . . . . . . . . . 128
xvii
7.1
Comparison of areas of designs optimized with and without the DfD phase. ICP - initial critical path; OCP - critical path after optimization; GC - cardinality of the complete golden cut; IArea - optimized design area without DfD; OArea - optimized design area with DfD; Area OH - is the overhead in area incurred due to pre-processing for symbolic debugging. . . . . . . . . . . . . . . . 146
8.1
Engineering change experimental results: overhead of performing modifications on register allocation instances using design-for-EC and EC, only EC, and complete resynthesis. . . . . . . . . . . . . 170
8.2
Engineering change experimental results: overhead of performing modifications on operation scheduling instances using design-forEC and EC, only EC, and complete resynthesis. . . . . . . . . . . 171
9.1
Watermarking technology mapping solutions for the MCNC suite. Columns present, respectively: name of the circuit, number of primary outputs, number of non- primary gates in the project description, and the solution quality (number of LUTs) when algorithm CutMap [Con96a] is applied to the original design. Each three-column subtable contains a column describing the number of LUTs in the watermarked solution, the hardware overhead with respect to the non-watermarked solution, and the likelihood that some other algorithm retrieves a solution which also contains the watermark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
xviii
9.2
Experimental results: watermarking LUT-based technology mapping solutions for a set of one small and five industrial designs. First four columns correspond to the columns in Table 9.1. Next, there are five subtables with structure identical to the subtables in Table 9.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
10.1 Experimental results describing the efficiency of applied local watermarking protocols to operation scheduling. . . . . . . . . . . . 211 10.2 Experimental results describing the efficiency of applied local watermarking protocols to template matching. . . . . . . . . . . . . 212
xix
Acknowledgments
I would like to thank to my advisor, Prof. Miodrag Potkonjak, for his neverending support, infinite amount of energy invested in my education, and all the knowledge conveyed in the past several years. He has been the ultimate guide through academics, the role model of research, and a friend to rely upon during hard times. As my professional carrier continues with Microsoft Research, I would like to thank to Rico Malvar and Yacov Yacobi for their support and great lessons learned during my internships at Microsoft Research. Working with them has attracted me to join the research center which has financially supported my graduate studies through the prestigious Microsoft Research Graduate Fellowship. The time I have spent in the Computer Science Department at the University of California, Los Angeles has been marvellous partly because of the great faculty that has helped me shape my technical background. I have had the great honor and delight to learn from and work with Prof. Milos Ercegovac, Prof. Richard R. Muntz, Prof. Jason Cong, and Prof. Andrew B. Kahng. I would also like to thank the department faculty who served on my thesis committee: Songwu Lu and D. Stott Parker, Jr. During my studies I have enjoyed working on several projects with two faculties from the Department of Electrical Engineering at our University: Prof. William Mangione-Smith and Prof. Mani B. Srivastava. My academic life at UCLA was enriched through friendships with my peers; my fiends who were always there to help, collaborate. My thanks go to Gang Qu, my officemate and best buddy at UCLA, Inki Hong and Chunho Lee, also members of Miodrag’s group, Milenko Drinic, Farinaz Koushanfar, David Liu,
xx
Seapahn Meguerdichian, and Jennifer Wong, younger colleagues with whom I have had the honor to work on several successful projects. Miodrag’s group was always open for collaboration with other research groups from our department, so I enjoyed working with Stefanus Mantik, Yean-Yao Hwang, and George Mustafa. Finally, my special thanks go to Mrs. Verra Morgan, our graduate admissions officer, who has always been with me in the hard days of being a graduate student. Her cheerful personality has brightened many stressful days at UCLA. My academic life would not be as successful without the immense support and sacrifice of the loving ones, my parents Risto and Verica and my girlfriend Sanja Trklja. Their forgiveness for the time spent working on my degree, their encouragement and support for my dedication have been a huge emotional impetus for me to prevail in my efforts.
xxi
Vita
1970
Born, Belgrade, Yugoslavia.
1992
Research Assistant, Department of Computer Science and Automation, University of Campinas, Brazil.
1995
Bachelor of Science degree in Electrical Engineering and Computer Science, University of Belgrade, Yugoslavia.
1995–1997
Teaching Assistant, Computer Science Department, UCLA. Taught sections of CS152B Computer Design and Interfacing Laboratory under direction of Prof. David A. Rennels. The course covered selected topics in the design and implementation of computer I/O interfaces and device controllers.
1997
Master of Science degree in Computer Science, University of California, Los Angeles. Thesis title: “Graph Coloring: Iterative Improvement Algorithm and Applications”. Thesis advisor: Prof. Miodrag Potkonjak
1997
Research Assistant, Computer Science Department, University of California, Los Angeles.
1997
Internship with the Advanced VLSI Group in Conexant Systems. Worked on VHDL - C interfaced co-simulation under the supervision of Lisa M. Guerra.
1998
Research Assistant, Computer Science Department, University of California, Los Angeles.
xxii
1998
Internship with the Advanced VLSI Group in Conexant Systems. Worked on combined simulation and emulation environments for verification of systems-on-chip.
1998
Awarded the Microsoft Graduate Fellowship for 1998/1999.
1999
Research Assistant, Computer Science Department, University of California, Los Angeles.
1999
Two internships with the Cryptography Group at Microsoft Research. Worked on screening technologies for audio content.
1999
Awarded the ACM-IEEE Design Automation Conference Graduate Fellowship for 1999/2000.
1999
Awarded the Microsoft Graduate Fellowship for 1999/2000.
1999
Worked as a consultant for Microsoft Corp.
2000
Research Assistant, Computer Science Department, University of California, Los Angeles.
2000
Joined Microsoft Research as a researcher.
Publications
D. Kirovski, M. Potkonjak, and L. M. Guerra. Cut-based Debugging for Programmable Systems-on-chip. IEEE Transactions on VLSI Systems, Vol.8, (no.1),
xxiii
pp.40-51, 2000.
I. Hong, D. Kirovski, G. Qu, M. Potkonjak, and M. B. Srivastava. Power Optimization of Variable Voltage Core-Based Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.18, (no.12), pp.1702-14, 1999.
D. Kirovski, J. Kin, and W. Mangione-Smith. Procedure based program compression. International Journal of Parallel Programming, Vol.27, (no.6), pp.45775, 1999.
D. Kirovski, M. Potkonjak, and L. M. Guerra. Improving the Observability and Controllability of Datapaths for Emulation-based Debugging. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.18, (no.11), pp.1529-41, 1999.
D. Kirovski, C. Lee, M. Potkonjak, and W. Mangione-Smith. Application-Driven Synthesis of Core-Based Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.18, (no.9), pp.1316-26, 1999.
I. Hong, D. Kirovski, K. Cornegay, and M. Potkonjak. High-Level Synthesis Techniques for Functional Test Pattern Execution. Integration, The VLSI Journal, Elsevier, Vol.25, (no.2), pp.161-80, 1998.
D. Kirovski, I. Hong, and M. Potkonjak. High Level Synthesis. Computer Aided Design of Integrated Circuits, Wiley Encyclopedia of Electrical and Electronics
xxiv
Engineering, John Wiley and Sons, Inc., 2000.
M. Drinic, D. Kirovski, S. Meguerdichian, M. Potkonjak. Latency-Guided OnChip Bus Network Design, ACM-IEEE International Conference on ComputerAided Design, 2000.
D. Kirovski, F. Koushanfar, and M. Potkonjak. Symbolic Debugging of Optimized Behavioral Specifications. ACM-IEEE International Conference on ComputerAided Design, 2000.
D. Kirovski, D. Liu, J. Wong, and M. Potkonjak. Forensic Engineering Techniques for Graph Coloring and Boolean Satisfiability. Design Automation Conference, pp.581-6, 2000.
D. Kirovski and M. Potkonjak. Localized Watermarking: Methodology and Application to Template Matching. ACM-IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol.6, pp.3235-8, 2000.
D. Kirovski, I. Hong, M. Potkonjak, and M. C. Papaefthymiou. Symbolic Debugging of Globally Optimized Behavioral Specifications. Asia South-Pacific Design Automation Conference, pp.397-400, 2000.
D. Kirovski and M. Potkonjak. Localized Watermarking: Methodology and Application to Operation Scheduling. ACM-IEEE International Conference on Computer-Aided Design, pp.596-9, 1999.
xxv
A. B. Kahng, D. Kirovski, S. Mantik, and M. Potkonjak. Copy Detection Techniques for VLSI Circuits. ACM-IEEE International Conference on ComputerAided Design, pp.600-3, 1999.
D. Kirovski and M. Potkonjak. Engineering Change: Methodology and Applications to Behavioral and System Synthesis. ACM-IEEE Design Automation Conference, pp.604-10, 1999.
D. Kirovski, M. Ercegovac, and M. Potkonjak. Low-Power Behavioral Synthesis Optimization Using Multiple-Precision Arithmetic. ACM-IEEE Design Automation Conference, pp.568-73, 1999.
G. Qu, D. Kirovski, M. Potkonjak, and M. B. Srivastava. Energy Minimization of System Pipelines Using Multiple Voltages. IEEE International Symposium on Circuits and Systems VLSI, Vol.1, pp.362-5,1999.
M. Potkonjak and D. Kirovski. Engineering Change Protocols for Behavioral Synthesis. ACM-IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol.4, pp.1993-6,1999.
D. Kirovski and M. Potkonjak. Synthesis of DSP Soft Real-Time Multiprocessor Systems-on-Silicon. ACM-IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol.4, pp.1901-4, 1999.
D. Kirovski, L. M. Guerra, and M. Potkonjak. Functional Debugging of Systems on Silicon. ACM-IEEE International Conference on Computer-Aided Design,
xxvi
pp.525-8. 1998.
D. Kirovski, Y.-Y. Hwang, M. Potkonjak, and J. Cong. Intellectual Property Protection by Watermarking Combinational Logic Synthesis Solutions. ACMIEEE International Conference on Computer-Aided Design, pp.194-8, 1998.
D. Kirovski and M. Potkonjak. Efficient Coloring of a Large Spectrum of Graphs. ACM-IEEE Design Automation Conference, pp.427-32, 1998.
I. Hong, D. Kirovski, G. Qu, M. Potkonjak, and M. Srivastava. Power Optimization of Variable Voltage Core-Based Systems. ACM-IEEE Design Automation Conference, pp.176-81, 1998.
M. Ercegovac, D. Kirovski, G. Mustafa, and M. Potkonjak. Behavioral Synthesis Optimization using Multiple-Precision Arithmetic. ACM-IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol.5, pp.3113-16, 1998.
D. Kirovski, C. Lee, M. Potkonjak, and W. Mangione-Smith. Synthesis of PowerEfficient Systems-on-Silicon. IEEE Asia South Pacific Design Automation Conference, pp.557-62, 1998.
D. Kirovski, J. Kin, and W. Mangione-Smith. Procedure Based Program Compression.
Thirtieth Annual IEEE/ACM International Symposium on Micro-
architecture, pp.204-13, 1997.
xxvii
Abstract of the Dissertation
Constraint Manipulation Techniques for Synthesis and Verification of Embedded Systems by
Darko Kirovski Doctor of Philosophy in Computer Science University of California, Los Angeles, 2001 Professor Miodrag Potkonjak, Chair
Recently, the industry has adopted the programmable system-on-chip platform as architecture of choice that meets the performance and feature demands posed by many multi-application user-customizable embedded systems. Due to increased design complexities, shortened time-to-market, and high sensitivity of product’s economic success to R&D costs, system designers have adopted design reuse as a standard hardware and software design paradigm. In this work, we present several design methodologies that use constraint manipulation as means of (i) enabling effective debugging of core-based systems-on-chip as well as (ii) enforcing protection of intellectual property (IP) from fraud and misappropriation. To facilitate system debugging, we introduce a cut-based functional debugging paradigm that leverages on the advantages of both emulation and simulation. The approach enables the user to run test sequences in emulation for fast execution and upon error detection, roll-back to an arbitrary instance in execution time, and transparently switch to simulation-based debugging for full design visibility and controllability. We have developed a symbolic debugger, a tool that enables designers to inter-
xxviii
act with the executing code at the source level. When a behavioral specification is optimized using transformations, values of source code variables may be either inaccessible at run-time or inconsistent with what the designer expects. In response to a query, the symbolic debugger retrieves and displays values of source variables in a manner consistent with respect to the source statement where execution has halted. To facilitate error correction, we have developed a set of engineering change tools; i.e. constraint manipulation algorithms which perform functional or timing modifications on the design, while minimally altering its original optimized specification. Since business models based on intellectual property are vulnerable to a number of potentially devastating obstructions such as misappropriation and IP fraud, we have developed two techniques for protecting copyright. The first technique, design watermarking, uses constraint manipulation to embed a secret message in the optimized design solution. The second technique, forensic analysis, analyzes whether a solution to an optimization problem could have been created using a particular optimization algorithm.
xxix
CHAPTER 1 Introduction 1.1
Emerging Embedded Systems
With the convergence of technologies such as the Internet, semiconductors, fiber optics, sensor networks, and wireless access transport and communication protocols, many applications considered to be ”buzzwords” few years ago have become a reality. Most of these applications are characterized as embedded systems, computing engines with application-specific purpose. In recent years, many of these systems have become a multibillion dollar business or have been stamped with an etiquette of expected high-margin profit enablers. Needless to say companies such as Cisco and Juniper have already established the market of an internet switch as one of the most profitable products. Emerging applications include data centers, Bluetooth, internet appliances, pocket PCs, home networking, wireless LANs, multimedia content distribution (see Figure 1.1). Although, in the core of its application target most of these systems are quite different there is a string of properties that characterizes all of them. In order to support a vast variety of functionalities, commonly high bandwidth, multimedia capabilities, low latency and low energy consumption, almost all of the modern embedded systems face high system complexity. The nature of the communications protocols also imposes a myriad of hard and soft real-time constraints, making system development a difficult task as validating
1
whether such constraints are satisfied is an exceptionally difficult task. As most of the embedded systems are poised for ubiquitous variants, reducing power consumption in such systems is crucial for compensating for short battery life. The financial success of a product is commonly not influenced by raw performance, as short time-to-market and cost effectiveness of the system development process play an important role in the margins that the product amasses. Most of the systems, especially the ones based on hardware implementations suffer from sensitivity to standard evolution that commonly happens even during the design process. Being able to respond to specification changes quickly and effectively is an important aspect of embedded systems design tools. Finally, design reuse as a crucial technology to cope with design complexity has impacted another devastating problem to the industry: protecting intellectual property and enforcing copyright. In summary, modern embedded systems are rarely competitive without being able to access the Internet, without offering scalable and secure services such as e-commerce, and without having competitive design and as low as possible production costs. In this work, we present several design methodologies that use constraint manipulation as means of (i) enabling effective debugging of core-based systems-on-chip as well as (ii) enforcing protection of intellectual property (IP) from fraud and misappropriation. To facilitate system debugging, we introduce a cut-based functional debugging paradigm that leverages on the advantages of both emulation and simulation. The approach enables the user to run test sequences in emulation for fast execution and upon error detection, roll-back to an arbitrary instance in execution time, and transparently switch to simulation-based debugging for full design visibility and controllability. We have developed a symbolic debugger, a tool that
2
Figure 1.1: Embedded systems: examples of ”buzzword-to-reality” applications that have recently become investment targets with large and high-margin profitable businesses. enables designers to interact with the executing code at the source level. When a behavioral specification is optimized using transformations, values of source code variables may be either inaccessible at run-time or inconsistent with what the designer expects. In response to a query, the symbolic debugger retrieves and displays values of source variables in a manner consistent with respect to the source statement where execution has halted. To facilitate error correction, we have developed a set of engineering change tools; i.e. constraint manipulation algorithms which perform functional or timing modifications on the design, while minimally altering its original optimized specification. Since business models based on intellectual property are vulnerable to a number of potentially devastating obstructions such as misappropriation and IP fraud,
3
we have developed two techniques for protecting copyright. The first technique, design watermarking, uses constraint manipulation to embed a secret message in the optimized design solution. The second technique, forensic analysis, analyzes whether a solution to an optimization problem could have been created using a particular optimization algorithm.
1.2 1.2.1
System-on-Chip Debugging Cut-Based Functional Debugging
The key technological and application trends, mainly related to increasingly reduced design observability and controllability, indicate that the cost and time expenses of debugging follow sharply ascending trajectories. Two most directly related factors are rapid growth in the number of transistors per pin and increased level of hardware sharing (trends and data presented in Figure 1.2). The analysis of physical data for state-of-the-art microprocessors (according to The Microprocessor Report) indicates that in less than two years (from late 1994 to mid 1996) the number of transistors per pin increased by more than a factor of two, from slightly more than 7,000 to 14,100 transistors per pin. At the same time, the size of an average embedded or DSP application has been approximately doubling each year, the time to market has been getting shorter for each new product generation, and there has been a strong market need for user customization of application-specific systems. Together, these factors have resulted in shorter available debugging time for increasingly complex designs. Finally, design and CAD trends that additionally emphasize the importance of debugging include design reuse, introduction of system software layer, and increased importance of collaborative design. These factors result in increasingly intricate
4
functional errors, often due to interaction of parts of designs written by several designers.
Figure 1.2: Key technology trends imposing that difficulty of debugging will likely become more difficult in the future. Pin and gate count are tabulated for processors in past three decades and corresponding trends are presented in the graph. Such technology and design trends indicate that functional verification emerges as a dominant step with respect to time and cost in the development process. As the complexity of designs increases, verification emerges as a dominant step with respect to time and cost in the development of a system-on-chip (SOC). For example, the UltraSPARC-I design team reported that debugging efforts took
5
twice as long as their design activities [Yan95]. Similarly, the designers of a modern superscalar microprocessor reported that the debugging process took more than 40% of the development time [Uch94]. The difficulty of verifying designs is likely to worsen in the future. The Intel development strategy team foresees that a major design concern for year-2006 microprocessor designs will be the need to exhaustively test all possible compatibility combinations [Yu96]. The same team also states that the circuitry in their future designs devoted to debugging purposes is estimated to increase sharply to 6% from the current 3% of the total die area. The two most important components for efficient functional and timing verification are speed of functional execution and design controllability and observability. Traditional approaches, such as design emulation and simulation, are becoming increasingly inefficient to address system debugging needs. Design emulation - implemented on arrays of rapid prototyping modules (FPGAs) or specialized hardware - is fast, but due to strict pin limitations provides limited and cumbersome design controllability and observability. Simulation - a software model of the design at an arbitrary level of accuracy - has the required controllability and observability, but is, depending on the modeling accuracy, two to ten orders of magnitude slower than emulation [Ziv96]. For example, the functional verification team for the new HP PA8000 processor reported 8 orders of magnitude difference in speed between the RT-level simulated (0.5 Hz on a workstation) and FPGA-emulated (300KHz) functional execution of their PA8000-based 200MHz workstation system [Man97]. The novel ideas proposed in this work advocate the development of a new paradigm for debugging and design-for-debugging of systems-on-chip. The new debugging technique integrates design emulation and simulation, in a way that
6
the advantages of the two are combined, while the disadvantages are eliminated. The functional debugging process, depicted in Figure 1.3, includes four standard debugging procedures: test input generation and execution, error detection, error diagnosis, and error correction. Long test sequences are run in emulation. Upon error detection, the computation is migrated to the simulation tool for full design visibility and controllability. To explain how execution is transferred from one execution domain to another, we introduce the notion of a complete cut. A complete cut is a subset of variables that fully determines the design state at an arbitrary time instance. The ability to read/write the state of a particular cut from/to the design is enabled by inserting register-to-port interconnects and appropriate scheduling statements into the initial design specification. The design techniques developed to enable migration of the execution are applied as a design post-processing step and, thus, can be used in conjunction with existing or future synthesis systems or manual design approaches. The running design (simulation or emulation) periodically outputs the cut state. These states are saved by a monitoring workstation. When a transition to the alternate domain is desired, any one of the previously saved states can be used to initialize, and then continue execution in simulation or emulation with preserved functional and timing accuracy. Once the error is localized and characterized in the error diagnosis step, the emulator is updated or built-in fault tolerance mechanisms are activated. Since current trends in the semiconductor industry show that programmable SOCs are becoming the dominant design paradigm, providing adequate verification tools for such systems is a premier engineering task. We have developed a generalized cut-based methodology for coordinated simulation and emulation of SOCs consisting of a system of programmable and application-specific cores.
7
FUNCTIONAL DEBUGGING Test Vector generation and execution Error detection ERROR DIAGNOSIS SIMULATION
EMULATION
CUT Observability and controllability
Fast functional execution
Error correction
Figure 1.3: The new concept of functional debugging. The running design periodically outputs the cut state, which is stored in a database. Any one of these states can be used to initialize, and then continue execution with preserved functional and timing accuracy. The methodology introduces a number of optimization problems and a need for efficient implementation mechanisms. We provide a set of tools that solve these problems for a mixed SDF-SISRAM model of computation. This computation model is frequently used in many communications, multimedia, and DSP applications. We propose a suite of algorithms that effectively identifies the minimal computation state (cut) and postprocesses the system components to enable I/O of cut variables. The experiments, conducted on a set of standard multi-core benchmarks and industry-strength designs, quantify the overhead induced to enable the developed debugging paradigm. In all cases, no or negligible hardware and performance overhead was incurred while providing both fast functional execution and full design controllability and observability.
8
1.2.2
Symbolic Debugging
Symbolic debuggers are system development tools that can accelerate the validation speed of behavioral specifications by allowing a user to interact with an executing code at the source level [Hen82]. Symbolic debugging must ensure that in response to a user inquiry, the debugger is able to retrieve and display the value of a source variable in a manner consistent with what the user expects with respect to a breakpoint in the source code. The application of code optimization techniques usually makes symbolic debugging harder. While code optimization techniques such as transformations must have the property that the optimized code is functionally equivalent to the unoptimized code, such optimization techniques may produce a different execution sequence from the source statements and alter the intermediate results. Debugging unoptimized rather than optimized code is not acceptable for several reasons: • while an error in the unoptimized code is undetectable, it is detectable in the optimized code, • optimizations may be necessary to execute a program due to memory limitations or other constraints imposed on an embedded system, and • a symbolic debugger for optimized code is often the only tool for finding errors in an optimization tool. In this paper, we propose a design-for-debugging (DfD) approach that enables retrieval of source values for a globally optimized behavioral specification. The goal of the DfD technique is to modify the original code in a pre-synthesis step such that every variable of the source code is controllable and observable in the optimized program. More formally, given a source behavioral specification
9
(represented as a control data flow graph [Rab91]) CDF G, the goal of the DfD approach is to enforce computation of a selected subset Vcut ∈ CDF G (cut) of user variables such that: • all other variables V ∈ CDF G can be computed from the cut Vcut (therefore Vcut represents a cut of the computation [Kir99]), and • the enforcement of computation of user defined Vcut variables has minimal impact on the optimization potential of the computation. The original code can be optimized with respect to target design metrics such as throughput, area, and power consumption. • the computation of non-cut from cut variables requires executing minimal number of operations. It is important to stress that finding a cut of a computation has been addressed in many debugging [Kir99] and software checkpointing [Ziv98] research works. However, symbolic debugging imposes a new constraint for the cut selection procedure: variables enforced to be computed should not restrict qualitatively the optimization process. The developed DfD techniques analyze the source computation and selects the cut variables according to a number of heuristic policies. Each policy quantifies the likelihood that a particular variable is not computed due to a specific transformation of the computation [Hon97]. In order to support fully modular pre-processing, explicit computation of selected cut variables Vcut is enforced by assignment of each variable vi ∈ Vcut to a primary output. Thus, application of any synthesis tool would result in an optimized behavioral specification CDF Go , which necessarily contains the selected cut variables Vcut ∈ CDF Go . At debugging time (simulation or emulation), the symbolic debugger monitors the values of cut variables. In response to a user
10
inquiry about a source variable vi ∈ Vcut , CDF Go and vi ∈ CDF G, all the variables in the cut that the variable vi depends on, are determined by a breadth-first search of the source CDF G with reversed arcs. Using these values, variable vi is computed using the statements from the original CDF G. The developed symbolic debugging technique poses a number of optimization tasks. We define these tasks, establish their complexity, and propose heuristic techniques for their solution. The effectiveness of the developed DfD methodology has been tested using several benchmark designs.
1.2.3
Error Correction: Constraint-Based Engineering Change
Due to the increasing complexities of modern systems-on-chip and more segmented design flows [Sha95], engineering change (EC) has recently emerged as the key enabling technology for shortening the time-to-market. The applicability of EC ranges from error correction in system debugging, performance tuning, adaptation to new functionalities and standards, and even low-power design [Buc97]. The fundamental goal for any set of EC tools is to provide the designer with the ability to easily perform functional or timing changes on the design, while minimally altering its specification throughout all levels of abstraction. In the case of RTL or logic network descriptions, a small change in the specification may result in significant perturbations of the underlying optimized structures (e.g. layout) [Fan97]. These consequences are highly undesirable because, in the case of fabricated circuits, modifications are performed using: • Mask updating and refabrication. • Spare logic: the designer stores spare logic in unused portions of the chip. If an error is detected, using a Focused Ion Beam (FIB) [Tho68] apparatus
11
for cutting and implanting new wires on a die, this logic can be utilized for error correction. Similar effects can be made by allocating memory cells to multiplex a set of wires allocated for EC. • Electron beam lithography: The FIB apparatus can be combined with electron beam lithography (EBL) to create a complete system for rewiring and implanting logic structures into an already fabricated design [Tho68]. There are two fundamental approaches to EC: design-for-EC, where a certain amount of logic or programmable interconnects with no effect on the functionality and timing constraints is augmented into the design before compilation; and postprocessing, where, knowing the correct functionality of the design, the optimized design is minimally altered such that the error is corrected. While the goal of the first technique is to anticipate which extra hardware might be useful in the case of an alteration, the second one has the difficult task of using a limited amount of resources to update the optimized design with minimal hassle. Although a number of techniques which address EC have been developed, until now, these efforts were mainly ad-hoc and unrelated to the design process. We introduce a new design methodology which facilitates both design-for-EC and post-processing to enable EC with near-minimal perturbation. Initially, as a synthesis pre-processing step, the original design specification is augmented with additional design constraints which ensure flexibility for future alteration. After the optimization algorithm is applied to the modified input, the added constraints impose a set of additional functionalities that the design can also perform. Upon diagnosis of an alteration in the initial design, a novel post-processing technique, which also facilitates constraint manipulation, achieves the desired functionality with a near-minimal perturbation of the optimized design. The key contribution which we introduce is a generalized constraint manipulation technique which
12
enables reduction of an arbitrary EC problem into its corresponding classical synthesis problem. As a result, in both design-for-EC and post-processing, classical synthesis algorithms can be used to enable flexibility and then perform the correction process. That is in opposition to the currently adopted research model for EC problems which seeks for new synthesis solutions. The problem of EC has initiated research activity mainly in the logic synthesis domain. However, due to the increasing complexity of behavioral specifications and increasing number of stages in the current “golden reference” [Gat94] and waterfall [Sha95] design flow models, designers are commonly faced with modifications which span over a number of design stages. In order to provide connectivity for EC through the entire design process, we demonstrate the developed EC methodology on a set of behavioral and system synthesis tasks. It is important to stress that all developed EC techniques can be applied to synthesis problems at all levels of design abstraction (e.g. logic synthesis, layout).
1.3
Intellectual Property Protection and Copyright Enforcement
The complexity of modern electronics synthesis as well as shortened time-tomarket has resulted in design reuse as a predominant system development paradigm. The new core development strategies have affected the business model of virtually all VLSI CAD and semiconductor companies. For example, recently a number of companies have consolidated their efforts towards developing off-the-shelf programmable or application-specific cores (e.g. ARM, LSI Logic, Design-andReuse). It has been estimated that more than half of all ASICs in year 2000 will contain at least one core [Tuc97]. To rapidly overcome the difficulties in
13
core-based system design, the Virtual Socket Initiative Alliance has identified six technologies crucial for enabling effective design reuse: system verification, mixed signal design integration, standardized on-chip bus, manufacturing related test, system-level design, and intellectual property protection (IPP) [VSI97]. The recently proposed Strawman initiative [VSI97] of the Development Working group on IPP calls for the following desiderata for techniques which act as deterrents in order to properly ensure the rights of the original designers. • Functionality Preservation. Design specific functional and timing requirements should not be altered by the application of IPP tools. • Minimal Hassle. The technique should be fully transparent to already complex design and verification process. • Minimal Cost. Both the cost of applying the protection technique and its hardware overhead should be as low as possible. • Enforceability. The technique should provide strong and undeniable proof of authorship. • Flexibility. The technique should enable a spectrum of protection levels which correspond to variable cost overheads. • Persistence. The removal of the watermark should result in a task of the difficulty equal to the complete redesign of the specified functionality. In addition to the stated VSI intellectual protection requirements, our approach also provides proportional protection of all parts of the design.
14
1.3.1
Watermarking Solutions to Logic Synthesis Optimization Problems
We have developed the first approach for IPP which facilitates design watermarking at the combinational logic synthesis level. The watermark, a designer- and/or tool-specific information, is embedded into the logic network of a design at a preprocessing step. The watermark is encoded as a set of design constraints which do not exist in the original specification. The constraints are uniquely dependent upon author’s signature. Upon imposing these constraints to the original logic network, a new input is generated which has the same functionality and contains user-specific information. The added constraints result in a trade-off. The more additional constraints, the stronger the proof of authorship, but the higher overhead in terms of quality of the synthesis solution. However, the application of the synthesis algorithm results in a solution which satisfies both the original and constrained input. Proof of authorship is based upon the fact that the likelihood that another application returns a solution to both the original and constrained input is exceptionally small. The developed watermarking technique is transparent to the synthesis step and can be used with any logic synthesis tool. We demonstrate that the developed IPP approach can be used to: • P rove authorship of the design at levels of abstraction equal or lower than logic synthesis. Existence of a user-specific signature in the solution of a multi-level optimization or technology mapping problem clearly identifies the author of the input design specification (initial input logic network). • P rotect the synthesis tool. The signature of the tool developer, embedded in logic synthesis solutions, clearly indicates the origin of the synthesis tool.
15
1.3.2
Local Watermarks: Methodology and Application to Behavioral Synthesis
Recently a number of techniques have been proposed for IPP of designs and tools at various design levels: design partitioning [Wol98], physical layout [Kah98], combinational logic synthesis [Kir98b, Lac98], behavioral synthesis [Qu98, Hon98], and design-for-test [Kir98d]. All these techniques facilitate augmentation of the user’s digital signature, encoded as a set of additional design constraints, into the original design specification in a pre-processing step with respect to the application of the optimization algorithm. The additional design constraints are spread over the entire design specification, thus, providing proportional protection for the entire design. The solution retrieved by the optimization algorithm satisfies both the original and user-specific constraints. This property is the key to enabling low likelihood that another algorithm (or designer) can build such a solution with only the original design specification as a starting point. Although extremely efficient, these techniques lack support for: • Effective signature detection. Since the encoding of a digital signature is dependent upon the structure of the entire design specification, detecting an embedded signature requires unique identification of each component of the design [Kir98b]. Moreover, possible design alteration by the misappropriator may negligibly, but significantly alter the design in a way that restoring the identifiers of design components requires detection of a number of subgraph isomorphisms [Kir98b]. Unfortunately, this problem is still listed as open in terms of its complexity [Gar79]. • Protection of design partitions. Mentioned techniques are quite effective in protecting overall designs, they do not provide protection for design parti-
16
tions. Namely, in many designs (cores), their parts may have substantial and independent value (for example, a discrete cosine transform filter in an MPEG codec). • Copied partition detection. Commonly, misappropriated designs or their parts are augmented into larger designs. This leaves no room for the existing protection techniques to facilitate the existence of a part of the watermark in a design as a proof for authorship. In this paper, we introduce local watermarks, a generic IPP technique which provides the aforementioned protection requirements and can be applied to many combinatorial and continuous optimization problems. We have applied this IPP methodology on a subset of behavioral synthesis tasks: template matching and operation scheduling. Watermarking designs at the these levels enables IP commerce of optimized behavioral specifications and RTL designs, which is exceptionally important for application-specific systems. It also protects behavioral synthesis tools and designs at levels of abstraction equal or lower than behavioral synthesis. This property is becoming increasingly important because of the progress of reverse engineering technologies (e.g. Take Apart Everything Under The Sun Co. [Tae]) which enable precise, fast, and confidential retrieval of the netlist of a silicon product. As in the previous IPP techniques, in local watermarking, a watermark is encoded as a set of design constraints which does not exist in the original specification. The constraints are uniquely dependent upon author’s signature. Rather than embedding a single error-corrected watermark over the entire design, as in the previous techniques, in local watermarking, a number of “small” watermarks are randomly augmented in the design. “Small”, in a sense that the constraints of each watermark are placed in a smaller part (locality) of the design. Each
17
watermark exists and can be detected in its locality in the design independently upon the remainder of the design. Therefore, such watermarks enable protection for parts of the design because the copy detection algorithm does not need to see the entire design in order to decode the added constraints. Upon imposing the user-specific constraints to the original behavioral specification, a new input is generated which has the same functionality but contains user-specific information. The application of the synthesis algorithm on such an input results in a solution which satisfies both the original and constrained design. Proof of authorship is based upon the fact that the likelihood that another application returns a solution to both the original and constrained input is exceptionally small. The added constraints may result in a synthesis trade-off. The more constraints, the stronger the proof of authorship, but the higher overhead on the solution quality.
1.3.3
Forensic Engineering of Solutions to Combinatorial Optimization Problems and an Application to Design Automation
The emergence of the Internet as the global communication paradigm, has enforced almost all semiconductor and VLSI CAD companies to market their intellectual property on-line. Currently, companies such as ARM Holdings [Arm], LSI Logic [Lsi], and MIPS [Mip], mainly constrain their on-line presence to sales and technical support. However, in the near future, it is expected that both core and synthesis tools developers place their IP on-line in order to enable modern hardware and software licensing models. There is a wide consensus among the software giants (Microsoft, Oracle, Sun, etc.) that the rental of downloadable software will be their dominating business model in the new millennium [Mic]. It is expected that similar licensing models become widely accepted among VLSI
18
CAD companies. Most of the CAD companies planning on-line IP services believe that copyright infringement will be the main negative consequence of IP exposure. This expectation has its strong background in an already ”hot” arena of legal disputes in the industry. In the past couple of years, a number of copyright infringement lawsuits have been filed: Cadence vs. Avant! [EET99], Symantec vs. McAfee [IW99], Gambit vs. Silicon Valley Research [GCW99], and Verity vs. Lotus Development [IDG99]. In many cases, the concerns of the plaintiffs were related to the violation of patent rights frequently accompanied with misappropriation of implemented software or hardware libraries. Needless to say, court rulings and secret settlements have impacted the market capitalization of these companies enormously. In many cases, proving legal obstruction has been a major obstacle in reaching a fair and convincing verdict [Mot, Afc]. In order to address this important issue, we propose a set of techniques for the forensic analysis of design solutions. Although the variety of copyright infringement scenarios is broad, we target a relatively generic case. The goal of our generic paradigm is to identify one from a pool of synthesis tools that has been used to generate a particular optimized design. More formally, given a solution SP to a particular optimization problem instance P and a finite set of algorithms A applicable to P , the goal is to identify with a certain degree of confidence that algorithm Ai has been applied to P in order to obtain solution SP . In such a scenario, forensic analysis is conducted based on the likelihood that a design solution, obtained by a particular algorithm, results in characteristic values for a predetermined set of solution properties. Solution analysis is performed in three steps: collection of statistical data, clustering of heuristic properties for each analyzed algorithm, and decision making with confidence quantification.
19
In order to demonstrate the generic forensic analysis platform, we propose a set of techniques for forensic analysis of solution instances for a set of problems commonly encountered in VLSI CAD: graph coloring and boolean satisfiability. We have conducted a number of experiments on real-life and abstract benchmarks to show that using our methodology, solutions produced by strategically different algorithms can be associated with their corresponding algorithms with relatively high accuracy.
1.4
Thesis Organization
The remainder of this document is organized in the following way. In Chapter 2, we first survey the state-of-the-art in related technologies methodologies for debugging, engineering change, design and content multimedia, and intellectual property protection. Next, in Chapter 3, we survey the existing design flows and how the methodologies developed in this work affect these flows. Finally, we present the constraint-based design methodologies in the following order: Chapter 4 discusses the new cut-based debugging platform for integrating simulation and emulation for application-specific cores, Chapter 5 extends the ideas presented in Chapter 4 to programmable systems-on-chip, Chapters 6 and 7 demonstrate two techniques for symbolic debugging: fast variable recovery and optimization transparent design for symbolic debugging respectively, Chapter 8 presents the novel engineering change methodology and its application to behavioral synthesis, Chapter 9 introduces a set of protocols for watermarking solutions to logic synthesis tasks, Chapter 10 discusses local watermarks as means of enforcing copyright on design partitions, and finally, Chapter 11 introduces a statistical methodology for forensic analysis of solutions to combinatorial optimization problems.
20
CHAPTER 2 State-of-the-Art in Related Work We survey the state-of-the-art in related work in the following research topics: emulation and simulation systems targeting system-on-chip debugging, symbolic debugging systems, engineering change techniques at all levels of hardware design abstraction, and intellectual property protection techniques in the electronics design automation industry. Many of these fields have recently become popular; have gained attention of the research community both in academia and in industry. However, as outlined in the first chapter, the trends in the industry are such that even today involved problems are of great importance with huge potential to be major bottlenecks and product showstoppers in the nearest future.
2.1
Design Debugging
State-of-the-art tools for system debugging have concentrated on enhancing the performance of simulation models as well as design visibility of hardware emulation blocks, rather than trying to provide methods for their synergy. State-ofthe-art RTL simulators are equipped with debuggers capable of performing error tracing and timing analysis [Syn] and simulation backtracking [Int]. Although instruction- and cycle-accurate programmable processor simulators provide full system visibility, their speed corresponds directly to the achieved accuracy and is often insufficient to debug complex systems [Ziv96, Ros95]. To partially over-
21
come these problems, great deal of attention has been paid to providing kernels for mixed-model co-simulation of systems integrated using building blocks from different parties [Men].
2.1.1
System Simulation
Several system simulation platforms are currently widely deployed. The Mentor Graphics Seamless co-verification environment ties instruction-level processor simulators with VHDL-level design specification into a co-verification platform ideal for checking reset, boot sequence, memory re-mapping, and instruction fetch [Men]. Undetected errors in these functions result in a design being deadon-prototype, a time consuming and challenging problem to correct. Seamless’ Coherent Memory Server and full functional CPU models together enable detailed execution of these critical design functions. Seamless can be used in synergy with a formal verifier without using hierarchical partitioning. The processor debugging engine is commonly integrated into the XRAY debugger which contains tightly integrated tools that accelerate the edit-compile-download-debug cycle. The performance of instruction-level simulators is commonly the bottle neck of such co-verification platforms. Compiled simulation engines introduced in [Ziv96] are capable of accelerating hardware-software co-verification by translating target code to compiled instead of interpreted native machine code. AXYS design automation builds processor models for Conexant and Infineon DSP processors. Synopsys offers a modeling-simulation platform capable of accepting designer input that ranges from static to completely dynamic [Syn]. The system evaluates statically scheduled design components at compile time to boost simulation performance, to take advantage of static scheduling at compile time to boost simulation performance. In addition, their clustering scheduler partitions static and
22
dynamic portions of the design and directs them to the appropriate simulation engine. During distributed simulation, the Synopsys RTL simulation analyzer Cyclone has unique features of enabling the user to backtrack execution and to collect traces-on-demand. Combined with full system observability and controllability, backtracking can significantly accelerate the debugging process. An important feature of similar simulation environments is ability to explore design specifications from an event-driven and cycle-based simuation methodologies, as well as to interface error tracing and timing analysis of faults [Int]. Fault simulation is another approach in performing the debugging process. Fro example, Cadence’s fault simulator Verifault-XL uses capabilities such as distributed fault simulation, statistical methods for fault coverage of functional test vectors, and fault list management to create a highly productive test environment [Cad]. The simulator’s unique ability to propagate faults through RTL code may be critical to verify networks of IP modules. Faults considered during the formal verification and generation of test vectors include blocking vs. non-blocking assignments, variable use before assignment, combinational loop detection, shift overflow, unequal operand lengths, and overlapping case expressions.
2.1.2
System Emulation
Hardware emulators have been developed as early as 1979 [Coc79], and have been under further development ever since [Pat95, Por85, Sam88]. Typical custom incircuit emulator circuitry comprises capture logic, which monitors the contents of the program address register, the internal data bus, and control lines of the processor; trace circuitry comprising a FIFO buffer, which puts data from the capture logic to the output pins of the chip; and a content addressable memory and a software programmable logic array with emulation counters that together
23
function as a finite state machine, which performs the desired predetermined testing. FPGA-based emulation systems have been developed by a number of companies including Quickturn [Qui], Ikos [Iko], Aptix [Apt], and Axis [Axi]. The observability and controllability of variables in such systems is a great challenge for emulator developers. The developed approaches are inefficient, expensive, or both. A common approach uses the expensive, low-bandwidth, and intrusive JTAG boundary scan methodology [Mau86]. The most advanced application of JTAG circuitry has been introduced in the industry’s first solution for run-time target application-host data exchange (RTDX) by Texas Instruments [Ti]. Software developers use C or DSP assembly code to address an internal data exchange library, which in turn makes use of a scan-based emulator to move data on and off chip via the JTAG serial test bus. Design controllability and observability can be obtained also by addressing user-customized SRAM memory cells (Quickturn Cobalt [Qui], Synopsys Eagle [Syn]) or by probing nets into the FPGA testbed (Quickturn System Realizer [Qui], Mentor Graphics SimExpress [Men], etc.). While the former case raises expenses, the latter reduces visibility (1024 signals can be traced during 4 million cycles in Mentor Graphics Celaro, 1152-6912 probes - each 128K deep - is available in System Realizer). Another approach to system debugging involves partitioning the system execution in simulation and emulation sub-systems (Quickturn Q/Bridge [Qui], Synopsys Eagle [Syn], Axis Corporation [Axi]). For example, Eagle uses emulation for the programmable components and simulation for the ASIC components. Novel challenges in system debugging are streamlined towards verification of emulation hardware with respect to the targeted functionality and timing [Liu92], efficient tracing of a subset of signals from the emulator [Kui94], and signal re-
24
construction for increased visibility and reduced emulation bandwidth demands [Mar98, Iko]. Independently with respect to the technology presented in this work, researchers at Axis Corp. have developed recently the most advanced product for combined emulation and simulation: Xcite [Axi]. Xcite is a Verilog simulator tightly coupled with a reconfigurable simulation co-processor (RCC). Xcite maps Verilog behavioral, RTL, or gate level modules into the RCC computing elements to maximize parallel processing in the RCC engine. Xcite has the unique ability to swap between RCC and software simulation instantaneously. This allows users to simulate as fast as possible in RCC to the point of a design error, and then swap out the simulation state into the Xcite software simulator for design debug. Within the Xcite software simulation session, you get full node visibility and all Verilog language commands at your disposal. Such an approach enables simulation performance of 10k to 100k cycles/second. An important methodology for developing an effective combined simulation and emulation system is checkpointing. Checkpointing is used in fault-tolerant computing systems to prevent data loss and recomputation. In the domain of parallel systems, the usage of local data checkpoints has been explored for construction of consistent global checkpoints [Tsa98, Wan97]. Checkpoints have been used for synchronization of redundant task executions in fault-tolerant real-time systems [Ber95].
2.2
Symbolic Debugging
We survey the previous work related to symbolic debugging along three lines: CAD for debugging, transformations for behavioral synthesis, and symbolic de-
25
bugging of optimized code. In the CAD domain, recently Powley and De Groat developed a VHDL model for an embedded controller that supports debugging of the application software [Pow94]. Koch, Kebschull, and Rosenstiel [Koc95] proposed an approach for source level debugging of behavioral VHDL in a way similar to software source level debugging through the use of hardware emulation. Transformations alter the structure of a computation in a such a way that the user specified I/O relationship is maintained. They are widely used as an effective approach for improving implementation of computations in software and hardware development. For formal definitions of all related transformations, we refer to the standard compiler reference works [Aho77, Fis88]. More than a hundred primitive transformations have been tabulated [Aho77, Ban93, Bac94, Fis88, Par95]. Common subexpression elimination is regularly used in many compilers and discussed in great detail in the compiler literature [Fis88, Aho77]. Most often the treatment of common subexpression in compiler research and development is based on value numbering [Coc70] and Sethi-Ullman numbering [Set70] techniques in peephole optimization framework [McK65]. Skip-Add is a combination of common subexpression replication and associativity, where the first one acts as an enabling transformation for the second one [Mil88]. Associativity, distributivity and commutativity are the three most often used algebraic transformations, commonly treated under the paradigm of tree-height reduction [Tri87, Har89, Lob91, Dun92]. When retiming is the only transformation of interest and the goal is the minimization of the critical path, several algorithms designed by Leiserson and Saxe provide an optimal solution in polynomial time [Lei91]. When the goal is minimal area or power, the problem has been proven to be NP-complete [Pot91]. Loop unfolding is one of those transformations which is regularly used in almost all optimizing compilers and many high level synthesis systems [Rab91]. Since the early seventies the treatment of linear recurrences has been widely stud-
26
ied in the compiler community, mainly when compilation to vector and highly pipelined computers are targeted [Pot92, Che75, Gaj81]. Another popular approach, particularly in logic synthesis [Bra84], is static ordering where the order of the transformations is given a priori, most often in the form of a script. Hennessy has categorized and presented models to describe the effects of local and global optimizations on symbolic debugging of program variables [Hen82]. DOC [Cou88] and CXdb [Bro92] are two examples of debuggers for optimized code which do not deal with global optimizations. Adl-Tabatabai and Gross [Adl96] discussed the problem of retrieving values of source variables when applying global scalar optimizations. When the values of source variables are inaccessible or inconsistent, their approach just detects and reports it to a user. The research presented in this paper is the first to explore enabling effective symbolic debugging with minimal impact on optimized design metrics. The developed techniques are applicable both to control and data intensive applications and therefore can be used in both hardware and software development.
2.3
Engineering Change
One of the first systems that facilitates engineering change (EC) has been developed to interact with a formal verifier in order to construct a tight error detect-correct engine [Mad89]. A number of developed techniques for EC synthesize rectifying logic networks that, by using and altering the input and output to the existing network, implement the desired functionality [Wat91, Kha96]. Techniques for minimal alteration of an existing logic network focus on developing estimation-based iterative search techniques for minimal logic resynthesis [Swa97] or on reusing gates from the initial implementation and restricting synthesis only to the modified portions [Bra94]. Alternative wires, interconnects that can re-
27
place a target wire without changing circuit’s functionality, have been shown to aid postlayout logic restructuring [Cha97]. Fang et al. have developed an RTlevel EC method which establishes data relationships between design stages and localizes the circuit affected by the change. Their Quick ECO system has been developed to support on-line debugging of FPGA-based logic emulators [Fan97]. Buch et al. have shown that a common EC technique such as rewiring can be used for power optimization of logic networks [Buc97].
2.4
Intellectual Property Protection
We survey the intellectual property protection (IPP) techniques along several directions: model packagers, design watermarking, and forensic analysis of designs. Model packagers are tools that encapsulate a design specification into a hardto-reverse-engineer executable capable of simulating designs functionality and interfacing with off-the-shelf RTL or gate-level simulation engines. Cadence’s IP Model Packager enables secure distribution of HDL IP through a multi-language encapsulation methodology (both VHDL and Verilog are supported through the INCA compile process) [Cad]. As a result, IP vendor controls IP distribution and licensing without mandatory licensing of any Cadence tools. Similar products have been developed by Escalade (IP Guard [Esc]) now acquired by Mentor Graphics and Synopsys (IP Modeling [Syn]). There are few negative drawbacks of model packagers: sensitivity to reverse engineering and providing design viewports to enable second party debugging. An alternative to model packaging is design watermarking. An invisible hardto-remove mark is embedded into a design such that the design copyright owner can frame a misappropriator in court. The main advantage of such a methodology
28
is non-restricted system visibility and public distribution. Recently, watermarking of artifacts has received a great deal of attention in the research community. Applications facilitating watermarking use the embedded data to track the usage of a particular artifact. Such tracking effectively and inexpensively enables pay-per-use applications and I-commerce of digital goods [Ben96, Ber96]. There is a wide consensus that IPP is the prime application of watermarking. A variety of techniques have been proposed for hiding data in still images [Bor96, Cox96]. Several techniques which exploit frequency and time imperfection of the human auditory system have been proposed for watermarking audio content [Ben96, Bon96, Cox96]. The AT&T research team developed a number of techniques for watermarking text documents [Ber96, Bra96]. The video-on-demand research resulted in a suite of approaches for watermarking video, mainly MPEG2 [Spa96, Har97]. As explicitly stated, none of these techniques can be applied to watermark active IP: designs or programs. Recently, researchers have endorsed steganography and code obfuscation techniques as viable strategies for design protection. Protocols for watermarking active IP have been developed at the physical layout [Cha99, Wol98], partitioning [Kah98], logic synthesis [Oli99, Kir98b], partial scan selection [Kir98d], and behavioral specification [Qu98, Hon98] level. A routing-level approach for fingerprinting FPGA digital designs has been introduced [Lac98]. It applies encrypted marks to the design in order to support identification of the design origin and the original recipient. In the software domain, good survey of techniques for copyright protection of programs has been presented by Collberg and Thomborson [Col99]. They have also developed a code obfuscation method which aims at hiding watermarks in program’s data structures. We trace the previous work related to forensic design analysis along the fol-
29
lowing lines: copyright enforcement policies and law practice, forensic analysis of software and documents, steganography, and code obfuscation. Software copyright enforcement has attracted a great deal of attention among law professionals. McGahn gives a good survey on the state-of-the-art methods used in court for detection of software copyright infringement [McG95]. In the same journal paper, McGahn introduces a new analytical method, based on Learned Hand’s abstractions test, which allows courts to rely their decisions on well established and familiar principles of copyright law. Grover presents the details behind an example lawsuit case [Gro98] where Engineering Dynamics Inc., is the plaintiff issuing a judgment of copyright infringement against Structural Software Inc., a competitor who copied many of the input and output formats of Engineering Dynamics Inc. Forensic engineering has received little attention among the technology researchers. To the best knowledge of the authors, to date, forensic techniques have been explored for detection of authentic Java bytecodes [Bak98] and to perform identity or partial copy detection for digital libraries [Bri95]. Recently, researchers have endorsed steganography and code obfuscation techniques [Col99] as viable strategies for content and design protection. Although steganography has demonstrated its potential to protect software and hardware implementations, its applicability to algorithm protection is still an unsolved issue. In order to provide a foundation for associating algorithms with their creations, in this work, for the first time, we present a set of techniques which aim at detecting copyright infringement by giving quantitative and qualitative analysis of the algorithm-solution correspondence.
30
CHAPTER 3 Modern Design Flows The complexity of modern application-specific systems has resulted in design flows which consist of a number of stages. Each of the stages addresses particular level of design abstraction: system synthesis, high-level synthesis, logic synthesis, floorplanning, and placement and routing [DeM94, Sha96]. However, due to involved design complexities, the shrinking time-to-market, and design re-use the process of going to one stage from another has slowly evolved into two common design paradigms. The two most widely accepted design flows are: the golden model and the waterfall model. Before we introduce these design flows, it is important to stress that due to emerging deep sub-micron technologies [Syl99], the future design flows are likely to become even more complex due to several reasons: (i) reused designs (cores) will shrink and their integration will become a primary design challenge, (ii) debugging of core interaction will lead to frequent and possibly significant design changes that will propagate through all levels of design abstraction, and (iii) the inter-core bus networks will impose new design strategies focused on synchronization, reprogrammability, and fault-tolerance. Thus, studying current design flows and addressing various aspects of the near-future design flows is a task of immense importance for the design automation community. Now, lets discuss the current design flow models and their individual components. The golden design flow model uses a copy of the design specification at
31
some level of abstraction (usually RTL) as a base design specification at which most of the changes are performed [Gat94]. Such a design flow enables that iterative optimizations at higher design levels are always propagated throughout the design flow. Although the optimization process benefits from such a design flow, important design factors such as time-to-market and cost suffer. The underlying concept behind the waterfall design process is a progression through various levels of abstraction with the intent of fully characterizing each level before moving to the next level [Sha95]. The main advantage of this design flow is that it can be exceptionally effective for rapid prototyping, however, it requires careful global design decisions early in the design flow. Failure to make an effective system and high-level design specification commonly leads to unoptimized IC products. Thus, both design flow types provide advantages and disadvantages which one has to carefully consider when planning the product development process. Regardless of the design flow type, included progression steps within the flow can be enlisted as follows. • Hardware-software co-design. The starting point of the design process includes partitioning of the computation into modules which will be assigned programmable or dedicated hardware resources. The resources investigated are usually high-level components such as processors, memory hierarchies, buses, bridges, I/O and DMA controllers, and dedicated functional blocks (for example, Viterbi decoders, DCT, error correction codecs, compression routines, etc.). Selected architectures are initially modeled to a specific level of accuracy (instruction- and cycle-accurate, or bit- and word-accurate [Ziv94]), their individual and system performance are estimated using various behavioral simulation tools, and finally inter-module communication is also estimated using bus-simulation tools at various degrees of accuracy.
32
In case of using parametrized programmable and static functional blocks, during hardware-software co-design block parameters are fully determined. For parametrized processors [Ten], at this step, corresponding retargetable compilation tools need to be generated. • Intellectual property selection and development.
Upon determining the
global system architecture, commonly the designer explores the pool of available intellectual property blocks, considers developing her own modules, and finally in case of external marketing of such IP blocks, the designer needs to ensure proper intellectual property protection methods: watermarking or secure packaging (for example, Cadence Interleaved Native Compiled code Architecture [Cad]). Important factor in this process is also IP integration, as commonly modules from different parties have different interfaces. Fortunately, few bus standards exist for IP integration, such as ARM’s AMBA [Arm] and IBM’s Open Bus Architecture [IBM]. IP selection and development also includes providing appropriate packaged simulation and verification tools that are used later on during the iterations of the design process. • High-level synthesis.
The iterative design optimization process usually
starts with improving the high-level specification. High-level synthesis includes the following design tasks [DeM94]. Design of finite state machines (FSM) for memory or control subsystems is an important design procedure as in modern designs FSMs are becoming extremely complex, their definitions and coding are error prone, and hard to debug. Hardware allocation and resource sharing, coupled with scheduling and memory inference removal are early performance determinators. Various behavioral optimizations such as application of common operation laws
33
(distributivity, associativity, and commutativity), pipelining, loop unfolding, and retiming are known to trace system performance at very early stages of the design entry [Rab91]. • Functional simulation. High-level descriptions are extensively tested before going onto the next level of design abstraction. Behavioral specs, defined in VHDL or Verilog, are run through simulation engines in search for errors or optimal hardware architectures. Methods such as mixed simulation (VHDL simulator attached to an instruction-level simulator for system performance and correctness estimation and verification) are a common requirement for the system designer. Tool features such as reverse debugging (Synopsys Cyclone [Syn]) and timing analysis (Interra’s debugger [Int]) are useful during the debugging process. It is important to stress that during the design flow iterations, frequent changes in the I/O specification, memory scheme or datapath can require an entire spec rewrite in the case of the golden copy flow model or exceptionally slow progression through design levels in the waterfall design flow. As schedules shrink, meeting system performance and cost leaves limited time to evaluate multiple hardware implementations of algorithms. There are an extremely large number of possible architectures even for a moderately complex design. With schedule pressures, design teams have no time to explore hardware structure trade-offs because specifying and simulating the RTL for even a single architecture takes so long. Usually, in the golden model, once the design is simulated, the architecture is locked. Designers resist changing the source RTL because then they must resimulate the system with modifications. • Logic synthesis. Synthesis tools typically make trade-offs between timing,
34
area, and power while selecting gates. Logic synthesis tools process the behavioral description of the system and create a gate-level net-list of the final design. This includes multi-level logic minimization, state assignment, technology binding, template mapping, etc. Good survey of methodologies used during logic synthesis is given in [DeM98]. • Scan insertion and test-pattern generation. In order to design Chain of scan registers is augmented into the design specification, replacing ordinary flipflops and latches, to enable effective sequential logic test. Namely, Cheng and Agrawal have shown that the complexity of testing designs can grow exponentially with respect to the length of cycles in the directed graph of a synchronous sequential network [Che89]. To resolve this problem, they have proposed an approach where a subset of registers is interconnected into a chain that can be controlled and observed from the chip I/O ports (scan registers). The selection has to be such that all circles in the design directed network contain at least one scan register. Therefore, the directed network is made acyclic which results in linear complexity of the testing algorithm for such sequential (acyclic) graph. Efficient algorithms and test vector generation techniques that supplement this design-for-test methodology have been extensively studied [Che89, Chi93, Bha93, Nor96, Mak97a, Mak97b, Cha98]. • Design verification, simulation, and emulation. Commonly designers spend two-thirds of development time and resources on this design step. As design complexities increase this ratio is bound to increase even more. Methods used during this process include formal verification methods which verify block interfaces and their corner cases (FormalCheck by Cadence [Cad], Zero-in’s checker [Zer]), functional and timing simulation, and emulation
35
or prototyping. The latter two are described in detail in Chapters 2, 4, and 5. • Place and route. Once performed as a low-level routine, today place and route is becoming more integrated into all levels of design abstraction due to the importance of delay estimation in deep submicron [Syl99]. Thus, prior to logic synthesis, commonly, design planning and top-level routing are performed. Then, unified synthesis and placement, floorplan-aware logic synthesis, and hard IP embodying are performed in-synch such that the final process of gate-level routing is as effective as possible. The final steps are interconnect design and standard cell routing. Once the initial cell-level design is complete, the design process goes through a new iteration. Data generated by the physical synthesis is propagated as physical and logical constraints to higher levels of design abstraction. For example, timing constraints created during place and route are used to drive logic synthesis and even high-level synthesis. If the design contains complex datapath logic, as high performance designs often do, then special datapath synthesis logic is applied to take advantage of the inherent structure. Finally, large timing-critical blocks are targeted with unified RTL synthesis and placement, for single-pass timing closure. In this work, we have developed a number of techniques which aim at improving both functionality- and timing-wise the design process. The developed cut-based design-for-debugging technique greatly facilitates the design verification process through a non-intrusive post-processing design methodology which augments the original design specification with constructs that enable (near-minimal) computation state transfer from an emulator to a simulator and vice versa. The constraint manipulation methodology for engineering change provides a highly
36
effective set of tools for propagating design changes through the design flow with minimal overall alterations. This results in reduced design efforts and faster verification. Finally, using another set of constraint manipulation techniques, we have developed an intellectual property protection methodology which augments statistically invisible marks onto an existing design with the goal of uniquely associating a given design with its copyright owner. In the remainder of this chapter, we describe in detail how each of these methodologies affects the design flow.
3.1
Design-for-Debugging
The key idea behind the cut-based approach for the integration of simulation and emulation is to leverage on the strong aspects of the two functional execution domains: simulation and emulation. The integrated debugging technique provides fast (emulation), and observable, and controllable (simulation) functional execution. The essence of the new debugging paradigm lies in the concept of a computation complete cut. A complete cut of a computation represents a subset of variables sufficient to correctly continue the computation regardless of the values of variables that are not part of the cut, i.e. all variables in the computation that are not part of the cut can be recomputed using only the cut variables. The full system state is a straightforward example of a state of a complete cut. However, a complete cut may be substantially smaller than the system state. Clearly, if one has complete controllability and observability over the state of all variables in the complete cut for a specific breakpoint, the computation can be continued functionally correctly from that breakpoint. A cut contains the complete information about the history of the computation process and its primary inputs until a given point in time (breakpoint). For the sake of brevity, from now on when
37
we say cut, we mean a complete cut. The typical functional debugging process, includes four standard debugging steps: functional test generation, error detection, error diagnosis, and error correction. The test input vectors are used to drive the design emulation. Error detection techniques are deployed to signal whether the emulation should be terminated due to an observed error. They can be as trivial as a check for equality of a particular variable and as complex as deployment of sophisticated error detection schemes for communication systems [6]. Upon detection, the fault is localized and characterized. This is done by simulating the design on a workstation starting from the set of cuts that precede a designer-selected checkpoint. The starting checkpoint is selected by the designer and should appear prior to the error occurrence. The functional error is diagnosed using a synergy of design simulation and emulation. Upon error diagnosis, the error is corrected by updating the hardware and/or software or masking the error using hardware/software workarounds. Our cut-based functional debugging approach is conducted using the following three phases of the design and debugging process: • Design post-processing. • Phase 1. - Defining the cut from the RT-level design specification. • Phase 2. - Augmentation of the design specification with cut statements which support controllability and observability when the design is executed in a debug mode; and • Debugging process. • Phase 3. - Simultaneous and coordinated design execution of the fabricated or emulated design and the appropriate simulator for efficient
38
debugging. In the first phase, a computation iteration at the behavioral-level of specification is logically partitioned into two or more components such that the cut between the partitions is complete. The synthesis support for exchange of information between simulation and emulation has the following three degrees of design freedom. • The determination of variables which form the cut; • The determination of the exact control step when the state of a particular variable is read or replaced by a user specified state. • The assignment of specific sets of I/O pins used to transfer variable states to or from the chip. It is important to notice the optimization trade-off involved in finding the optimal cut. The optimization procedure has a unique goal: to add minimum hardware resources into the initial design, while obtaining its full controllability and observability. A cut with a minimum number of variables seems to be an attractive solution. If those variables are simultaneously alive, more registers that hold the variables of the cut-set have to be provided with register-to-I/O pin interconnects. Therefore, a favorable cut is the one which consists of variables with long disjoint life-times and has the property that a small number of machine registers contain the cut variables. Once an optimal cut is found, the next design problem is to define the sequence of control steps in which the variables are dispensed out of the chip. The freedom of transferring the cut state is limited due to the control steps when the I/O pins are busy. Due to lack of idle cycles, in some cases, all variables of the cut cannot
39
be transferred over the I/O pins of the emulator. One straightforward solution to this problem is to allocate near-minimal buffer to hold the unscheduled variables. Potkonjak, Dey, and Wakabayashi have utilized the concepts of pipelining debugging variables for improving their scheduling and assignment freedom and use of I/O buffers for improving resource utilization of I/O pins [Pot95]. They do not search for a complete cut of a computation. Instead, they derive provably optimal bounds for the maximum cardinality of the set of controllable and observable variables for a given design specification. Most importantly, they have developed a non-greedy heuristic minimization algorithm for I/O buffer allocation, which can be successfully used in the cut-based debugging framework. In the second phase of the synthesis approach, the original design specification is augmented with additional resources that enable design observability and controllability. For example, the following input operation is incorporated to provide complete controllability of variable V ar1 using user specified input variable Input1 : if (DEBUG) then V ar1 = Input1 ; In the case of pipelined functional units, their pipeline latches are not subject to inclusion into cuts. However, for programmable platforms with states inaccessible by instructions (pipeline, branch predictors) and memory hierarchies, the problem of outputting the machine cut state becomes a time-lengthy process and is addressed in Chapters 4 and 5. The platform presented there encapsulates a complex environment with a set of programmable cores, a number of ASIC accelerators, and a memory hierarchy.
3.1.1
Design-for-Debugging for Programmable and Statically-Scheduled Computation Platforms
This subsection describes the key steps of the design-for-debugging approach when applied to a system-on-chip platform that consists of both programmable
40
SPC
Legend
MPC
ASIC 1
ASIC 3
ASIC 4
IN
MPC instruction that initiates the cut-set I/O
Computation iteration with its primary input and output PC computation with primary inputs and outputs
Cut 2
Cut 1
Cut 1 start PC
IN
IN
A
B OUT
OUT
Cut 3
IN
A
IN OUT
Cut SPC
Computation in time
Part of the computation at which the core transfers its cut
IN
start ASC
B
Cut MPC
OUT
Figure 3.1: Cut-based debugging: an exemplary process of outputting cut variables of all cores (both programmable and application-specific) in the system through a common bus structure. and application-specific circuits. Figure 3.1 illustrates the technical details of the process of cut I/O from such a system. The instrumentation of the code, which runs on the MPC, starts the cut export process. For example, as shown in Figure 3.1, it first sends a signal (start ASIC) to the ASICs that starts the cut output sequence of all ASICs. The process of their cut I/O is statically scheduled. Cuts of cores ASIC1 (Cut1) and ASIC3 (Cut2) are interleaved and the cut of core ASIC4 (Cut3) is output after the cuts of cores ASIC1 and ASIC3. Since the cut I/O of the ASICs is statically scheduled, the MPC knows when the ASIC cut export is complete. At that point, the MPC polls (start SPC(i)) the cut I/O control of each SPCs in the system. Upon receipt of this signal the virtual tristate gate that controls the actual I/O of cut variables onto the shared bus is enabled. The instrumented code running on the SPC has to be able to assure
41
that exactly one cut I/O (Cut SPC) is completed. Once its cut is dispensed, the SPC sends a signal back to the MPC that acknowledges one successful cut I/O process. At that point the MPC initiates its own cut I/O, which represents the end of the cut I/O process.
3.1.1.1
ASIC Design-for-Debugging
During the design of an application-specific core, debug functionality is added as a postprocessing step. This functionality includes a set of register-to-output interconnects that enables the export of a number of different cuts and a hardware feature that enables the system integrator to select a specific cut. The motivation of enabling multiple cuts for I/O is as follows: (i) the designer often does not know the system integration constraints during the design of an individual component and does not want to modify the design upon request at integration time and (ii) the system integrator can benefit from multiple cuts if hard-to-solve scheduling instances are encountered. The I/O of variables of a particular computation cut is enabled by explicit connection of registers that store these variables to the I/O ports of the ASIC (if these registers are not already connected). On the other hand, note that one subset of registers may be used for I/O of a number of different cuts. This property of the register selection for cut I/O can be used to enable selection of the particular cut to be output at integration time. The goal is to achieve more flexible cut I/O in the cases when multiple cores are outputting their cuts on the same bus or the developed core is used as a subblock in a larger core. The integrator controls the selection of a particular cut using, for example, different control microcode.
42
3.1.1.2
Code Compilation-for-Debugging
In general, each programmable core has two components in its cut: instructionaccessible states (e.g. general-purpose registers), and states non-accessible using machine code (e.g. branch prediction hardware, caches, and pipeline latches). The part of the cut accessible to instructions is transferred using debug instructions that are instrumented into the original code. The portion of the cut that is not accessible by instructions can be exported in several ways. Many state-ofthe-art processors, such as Texas Instruments line of DSP processors [Ti], provide built-in debug-ports that control breakpoint, pipeline and general purpose registers, and memory access logic. Alternately, flushing of caches, pipelines, and/or branch predictors can be used as means of state invalidation. However, this approach may decrease the performance of the debugging system. Such deficiency may be unacceptable for debugging real-time systems. An alternate approach is to shadow the invisible states in such a way that at the moment of cut I/O the programmable core continues processing using one copy, while the shadowed copy is transferred to the monitoring workstation. Upon transfer, the shadowed copy is updated with the latest changes. The debug instructions can be added either before or after compilation. While the pre-compilation choice requires in-source-code embedding of cut I/O instructions (for example, used in the MIPS Pixie [Smi91]), the postprocessing step encompasses object code instrumentation similar to that implemented in Purify (Purify uses this technique to locate memory access errors) [Has92]. The precompilation approach has a significant advantage due to independence with respect to the hardware platform. Debugging platforms that can provide real-time JTAG support for such an approach have been already developed and marketed [Mau86].
43
Programmable core [IP block] IN X(N) D(N) = X(N) + D(N-1)A1 + D(N-2)A2 OUT D(N) if DEBUG D(N) = IN Y(N) = D(N)B0 + D(N-1)B1 + D(N-2)B2 OUT Y(N)
Master programmable core signal to start cut-set output
Shared bus
ASIC core 1
Debug Buffer
ALU
ASIC core 2
Scheduling Logic
Register File
Figure 3.2: A generic system architecture for the developed debugging platform. It consists of individual cores, the embedded software running on these cores, an inter-core bus network, and a set of protocols for core intercommunication. An example of instrumented code is given in Figure 3.2 where the programmable core executes a 2nd order direct-form IIR filter. The instruction OUT(D[N]); is used for observability and if Debug D[N] := IN; for controllability. Controllability in emulation is beneficial because of the following application. The designer can change easily the state of the computation in simulation. Next, as a part of the debugging process, using emulation controllability, the designer can transfer the computation state from simulation to emulation, and hence, restart the computation in emulation with an arbitrary computation state as a starting point. The instrumentation process is performed in four phases. In the first phase, the minimal-size cut for each statically scheduled user-defined SDF computation
44
island running on each programmable core is identified. An SDF computation island in a SISRAM computation [Aho83] is a set of instructions that process input data following the SDF computation model. The SDF computation islands are triggered by interrupts and executed pseudo-periodically according to the input data rate. In the second phase, the code is augmented with debug instructions that perform cut I/O. In the third phase, we identify the cut variables outside the SDF islands. Finally, we instrument the code with instructions that initiate and call the function performing the system state I/O. Usually, the call to this function is placed in the main loop of the program. Using profiling tools, the designer determines the best location and frequency of calling of this function. The fourth phase of instrumentization for debugging is currently not automated.
3.1.1.3
Integration-for-Debugging
The ASIC developer provides the system integrator with information about the set of cuts that can be enabled. For each ASIC, the variables and control steps at which they can be dispensed through the virtual pins of the ASIC are given. The system integrator faces three design problems. First, for each ASIC a single cut has to be selected. Second, the selected cuts, jointly with the primary inputs and outputs, are scheduled for I/O over the available set of pins. We integrated these two phases into a tight optimization loop, which searches for a feasible scheduling. Finally, if no scheduling is found, the ASIC cuts are transferred sequentially in such a way that no two scheduled cut variables are displayed on the system bus in the same control step.
45
3.2
Pre-processing for Symbolic Debugging
The global flow of the symbolic debugging process is depicted in Figure 3.3. As a compilation pre-processing step, the developed Design-for-Debugging (DfD) technique analyzes the original behavioral specification CDF G in order to select a complete golden cut GC, which is optimization-friendly. Upon selection, the DfD procedure augments the original specification with statements that enforce computation of golden cut variables. If the DfD approach is part of an optimizing compiler, this step can be performed by marking variables. An independent modular DfD technique would achieve the same goal by specifying the golden cut variables as output variables. Once computation of the golden cut variables is assured, the modified behavioral specification CDF Gm is processed by a synthesis tool. The result of this process is an optimized behavioral specification CDF Go with guaranteed existence of golden cut variables. While monitoring code execution, the symbolic debugger scans for values of golden cut variables and stores them in designated buffers. Since the computation of a single source variable may involve values of golden cut variables from several iterations, the depth of each buffer can be larger than one. The expectation is that the cardinality of cut variables should be much smaller than the cardinality of variables in the source CDF G (see Chapters 4 and 5). Therefore, the memory overhead for golden cut maintenance is in general low. While debugging, at a specific breakpoint the user inquires about a source variable vi in the source CDF G. Initially, the symbolic debugger determines if vi exists in the optimized CDF Go . This step can be efficiently performed by keeping a list of variables that exist in both the source and optimized CDFGs. If the variable does not exist in the optimized code CDF Go , then its value is computed from the golden cut. All the variables in the cut that the variable
46
CDFG
Variable X = ?
Search for complete golden cut
X in CDFG o?
yes no Depth-first search to determine subset S of GC on which X depends.
Design-for-Debugging
GC
Specification augmentation
Compute X from S by performing operations in CDFG
CDFG m
Synthesis
Print value of X
Symbolic Debugging
CDFG o
Figure 3.3: Global design flow for the developed design for symbolic debugging methodology. vi depends on, are determined by a breadth-first search of the source CDF G with reversed arcs. Finally, we compute variable vi using the cut values and the statements from the original CDF G.
3.3
Pre- and Post-processing for Engineering Change
As the complexities of behavioral specifications increase, both design flows are becoming more vulnerable to the engineering change (EC) process due to the demand for updating designs throughout many stages. To address this issue, we have developed a generic EC methodology, applicable to all design stages, which facilitates constraint manipulation to augment the design with flexibility
47
for future changes. The EC is conducted by searching for a correction that induces minimal hassle of the optimized solution. Flexibility for EC is achieved in a synthesis pre-processing step as shown in Figure 3.4. The initial behavioral design description BD is augmented with additional design constraints (BDa ). The additional constraints reflect the demand for flexibility. For example, for register allocation, i.e. graph coloring, in order to impose that two variables which may be stored in the same register are assigned to different ones, an edge has to be added between these variables in a pre-processing step to graph coloring. The application of the optimization algorithm to BDa provides a solution OptDa that can satisfy both the original and EC-targeted constraints. The additional design constraints can be focused towards a particular type of an error or augmented to provide a guaranteed flexibility for EC after an arbitrary error is diagnosed. The trade-off of having significant design flexibility with respect to a small hardware overhead can be tuned according to the designer’s needs. The error correction post-processing is performed on the augmented design specification BDa with a desire to alter as few as possible design components and create an optimized design with a given functionality cOptDa . The error correction process is conducted iteratively in a loop with three steps. In the first step, the correction process is restricted to a partition OBDa ∈ BDa . OBDa contains a set of corrections and its closest neighborhood. The optimization process is applied only to this portion of the design, while the optimization solution for the remainder of the graph is left intact. In the second step, the constraints of the remainder of the design rBDa = BDa − OBDa are manipulated. Although the manipulated part of the design, rcBDa , presents a problem of smaller cardinalities, its constraints have the same impact on OBDa . The constraint
48
Behavioral design spec (BD)
Additional constraints
Behavioral design spec (BD_a)
EC pre-processing algorithm
Alterations
Design partitioning
Spec part to leave intact (rBD_a)
Additionally constrained design spec (BD_a)
Constraint manipulation
Off-the-shelf Synthesis Tool
Optimized spec (OptD_a)
Off-the-shelf Synthesis Tool
Pre-processing for EC
(Binary) Search for min hassle ended?
Spec part to update for EC (OBD_a)
Spec part merger
rcBD_a merged with OBD_a NO
YES
Corrected optimized spec (cOptD_a)
Post-processing for EC Flow of specs during EC BD_a
rBD_a OBD_a
MBD_a
rcBD_a OBD_a
Off-the-shelf Synthesis Tool
Opt. solution to MBD_a
Synthesis Tool OptD_a
Merged optimization subsolutions to rBD_a and OBD_a
cOptD_a
Figure 3.4: The design flows for two developed engineering change methodologies: design for engineering change and post-processing for engineering change.
49
manipulation algorithm is heavily dependent upon the actual optimization problem. Details of several such algorithms are presented in the Section 5. In the last step, the off-the-shelf optimization algorithm is applied to the merger of parts M BDa = rcBDa ∪ OBDa . Portion of the solution to this problem which corresponds to OBDa is then replaced in the initial optimized solution OptDa resulting in a corrected optimized solution cOptDa . The increased flexibility for EC on the initial design specification BDa enables more efficient search for the update that satisfies the correction. The described loop is repeated in a search for the smallest subdomain OBDa of the original specification where the error correction is performed.
3.3.1
Constraint-based Intellectual Property Protection
We developed an IPP approach which enables the designer or tool developer to embed a signature into the optimized design during execution of several combinational logic synthesis tasks. The key idea is to augment information into the initial specification of the design in such a way that after one of combinational synthesis steps is applied, we have both functionally correct design as well a proof that design is done by the designer and/or the tool. The synthesis flow which employs watermarking of combinational logic synthesis solutions encompasses several phases illustrated in Figure 3.5. The first three phases in the watermarking approach are the same for both multi-level logic minimization and technology mapping. In the first step, the gates in the initial logic network specification are sorted using an industry specified standard. As a result of this procedure, each gate is assigned a unique identifier. Next, the gate ordering is permuted in a way specific to the designer’s or tool developer’s signature. For this purpose, we use a keyed RC4 one-way function to generate
50
The Original Design Specification
Assignment of an unique ID to each gate in the netlist
EDA Standard for IP Protection
The Synthesis Automation Tool
Netlist
Ordered set of nodes
Keyed one-way pseudo-random node permutation
Author’s ID Secret Key
Signature-driven node permutation
Adding signature-specific constraints to the design. Enforcement of first-K nodes to appear in the final solution. The existance of these nodes in the solution constitutes the watermark of the logic synthesis solution. First-K nodes Enforced primary output
The Additionally Constrained Specification
The Watermarked Optimized Design. Netlist
Applying the Synthesis Automation tool. Netlist
Technology-mapping
Figure 3.5: The protocol for hiding information in solutions for multi-level logic optimization and technology mapping. pseudo-random bits [Men97] which guide the process of iterative gate selection. In the next phase, the outputs of first K gates in the pseudo-random permuted ordering are selected for explicit assignment to primary outputs. In the case of technology mapping, this phase represents the final phase in the watermarking protocol. If multi-level logic minimization is performed, the generated pseudoprimary outputs are used as inputs into an additional logic network which is embedded into the initial design specification. This network is created according to the author’s signature. After additionally constraining the initial design
51
specification, the optimization algorithms are applied to the constrained logic network. The result retrieved by the synthesis algorithm satisfies both the initial and constrained design specification. The proof of authorship is dependent upon the likelihood that some other algorithm, when applied to the initial design specification, retrieves solution which also satisfies the constrained input.
52
CHAPTER 4 Improving the Observability and Controllability of Datapaths for Emulation-based Debugging Growing design complexity has made functional debugging of ASICs crucial to their development. Two widely used debugging techniques are simulation and emulation. Design simulation provides good controllability and observability of the variables in a design, but is two to ten orders of magnitude slower than the fabricated design. Design emulation and fabrication provide high execution speed, but significantly restrict design observability and controllability. To facilitate debugging, and in particular error diagnosis, we introduce a novel cut-based functional debugging paradigm that leverages the advantages of both emulation and simulation. The approach enables the user to run long test sequences in emulation, and upon error detection, roll-back to an arbitrary instance in execution time, and transparently switch over to simulation-based debugging for full design visibility and controllability. The new debugging approach introduces several optimization problems. We formulate the optimization tasks, establish their complexity, and develop most-constrained least-constraining heuristics to solve them. The effectiveness of the new approach and accompanying algorithms is demonstrated on a set of benchmark designs where combined emulation and simulation is enabled with low hardware overhead.
53
4.1
Introduction
The key technological and application trends, mainly related to increasingly reduced design observability and controllability, indicate that the cost and time expenses of debugging follow sharply ascending trajectories. Two most directly related factors are rapid growth in the number of transistors per pin and increased level of hardware sharing. The analysis of physical data for state-of-the-art microprocessors (according to The Microprocessor Report) indicates that in less than two years (from late 1994 to mid 1996) the number of transistors per pin increased by more than a factor of two, from slightly more than 7,000 to 14,100 transistors per pin. At the same time, the size of an average embedded or DSP application has been approximately doubling each year, the time to market has been getting shorter for each new product generation, and there has been a strong market need for user customization of application-specific systems. Together, these factors have resulted in shorter available debugging time for increasingly complex designs. Finally, design and CAD trends that additionally emphasize the importance of debugging include design reuse, introduction of system software layer, and increased importance of collaborative design. These factors result in increasingly intricate functional errors, often due to interaction of parts of designs written by several designers. Such technology and design trends indicate that functional verification emerges as a dominant step with respect to time and cost in the development process. The difficulty of verifying designs is likely to worsen in the future. The Intel development strategy team foresees that a major design concern for their year-2006 microprocessor will be the need to exhaustively test all possible computational and compatibility combinations [Yu96]. Traditional approaches, such as design emulation and simulation, are becoming increasingly inefficient to address system
54
debugging needs. Design emulation - implemented on arrays of rapidly prototyping modules (FPGAs) or specialized hardware - is fast, but due to strict pin limitations provides limited and cumbersome design controllability and observability. Simulation - software model of the design at an arbitrary level of accuracy - has the required controllability and observability, but is, depending on the modeling accuracy, two to ten orders of magnitude slower than emulation [Ziv96, Ros95]. FUNCTIONAL DEBUGGING Test Vector generation and execution Error detection ERROR DIAGNOSIS SIMULATION
EMULATION
CUT Observability and controllability
Fast functional execution
Error correction
Figure 4.1: The new concept of functional debugging. The running design periodically outputs the cut state, which is stored in a database. Any one of these states can be used to initialize, and then continue execution with preserved functional and timing accuracy. The novel ideas proposed in this work advocate the development of a new paradigm for debugging and design-for-debugging of ASICs. The new debugging technique integrates design emulation and simulation, in a way that the advantages of the two are combined, while the disadvantages are eliminated.
55
The functional debugging process, depicted in Figure 4.1, includes four standard debugging procedures: test input generation and execution, error detection, error diagnosis, and error correction. Long test sequences are run in emulation. Upon error detection, the computation is migrated to the simulation tool for full design visibility and controllability. To explain how execution is transferred from one execution domain to another, we introduce the notion of a complete cut. A complete cut is a subset of variables that fully determines the design state at an arbitrary time instance. The ability to read/write the state of a particular cut from/to the design is enabled by inserting register-to-port interconnects and appropriate scheduling statements into the initial design specification. The design techniques developed to enable migration of the execution are applied as a design post-processing step and, thus, can be used in conjunction with existing or future synthesis systems or manual design approaches. The running design (simulation or emulation) periodically outputs the cut state. These states are saved by a monitoring workstation. When a transition to the alternate domain is desired, any one of the previously saved states can be used to initialize, and then continue execution in simulation or emulation with preserved functional and timing accuracy. Once the error is localized and characterized in the error diagnosis step, the emulator is updated or built-in fault tolerance mechanisms are activated. The new debugging approach introduces a number of optimization problems involved in the design-for-debugging post-processing phase. The developed set of optimization methods aims to add minimum hardware overhead and still provide efficient integration of the two functional testing domains. The applied algorithms are constructed using the most-constrained least-constraining heuristic methodology. The efficiency of the developed algorithms is tested on a set of real-life
56
examples, where combined simulation and emulation debugging is provided with exceptionally low implementation overhead. IN
*
C8
*
+ OUT
C9
A1
C10
A2 +
D1 *
A4 +
D2 *
A6 +
D3
* A10 +
*
+ A5
C14
+ A7
C16
+ A9
C18
C19
a) Control data flow graph of a fifth order continuous fraction infinite impulse response CF IIR filter.
b) Assigned, allocated, and scheduled control data flow graph
Figure 4.2: Optimal cut example.
D3
3
R0
R1
R3
R2
D5 R4
+ A2
C9 * R5 C10
*
+ A4 R5
C12
+ A6
* R5
+ A8 5 R5 * + A10 R0 R1 R2 C16 R5 6 R3 R4 * C18 R5 7 A9 + * C19 R5 8 A7 + * C17 9 R5 A5 + * C15 R4 10 R5 R3 A3 + * C13 R2 R5 11 * C11 R1 A1 + 12 R0 C14
OUT D1
D2
*
D3
D4
D5
(a) CDFG and (b) allocated, assigned,
and scheduled CDFG for the fifth order CF IIR filter. picts two cuts:
D4
R5
4
C17
*
D5
1 2
C15
*
D4
C12
D2
C13
* *
A8 +
C11
*
D1
C8 * + A3
*
IN 0
Subfigure (b) de-
C1 = {IN, D1, D2, D3, D4, D5} with dotted edges and
C2 = {IN, A2, A4, A6, A8, A10} with bold edges.
4.1.1
Motivational Example
The diagnosis approach and accompanying optimization issues are illustrated using a fifth order continued fraction infinite impulse response filter. Figure
57
4.2(a) shows the control data flow graph for this filter. In Figure 4.2(b), the assignment and scheduling of the same computation structure is depicted for an architecture that consists of one multiplier, one adder, and six registers. The goal of the design-for-debugging step is to allocate minimal hardware resources that enable the cut state to be observed and controlled. The primary requirement of this design post-processing for debugging is to avoid changing the existing design allocation, assignment, and scheduling. In order to avoid addition of new I/O ports, the cut should be scheduled for transfer at control steps when the states of input and output variables are not imported/exported (in this example, control steps 1 through 11). For controllability, during loading of the cut state, the value of each variable in the cut should be written before its first usage. For observability, during export of the cut, the state of each variable in the cut should be read before the value of the same variable for the next computation iteration or another variable will overwrite it. An integral part of any complete cut is the primary input and output of the system. A trivial candidate for a subset of variables, which constitutes a computation complete cut, is C = {D1, D2, D3, D4, D5} (dotted lines in Figure 4.2(b). The state of the cut and the input completely defines the state of a particular iteration of the depicted computation. Hence, a particular cut state along with a correspondingly synchronized input sequence can be used to restart the computation correctly on an execution engine. A possible set of control steps at which the cut state can be input is CS = {1, 2, 3, 4, 5}. Since all variables in C are concurrently alive, they must be stored in five different registers. The designer requires access to read/write into these registers from the designated I/O pins. Since the cut is stored in five registers, five register-to-I/O connections have to be allocated to enable cut observability and controllability.
58
As a lower overhead alternative, consider the cut consisting of the output variables of additions A2, A4, A6, A8 and A10 (bold lines in Figure 4.2(b). Only one register (R5) is required to hold the values of these variables since they are not alive simultaneously. In this case only one register-to-port connection is dedicated to the register that holds the cut. Cut dispensing is performed in five consecutive control steps: 2, 3, 4, 5, and 6.
4.1.2
Computation and Hardware Model
Two main, often contradictory, criteria for evaluation of system and behavioral synthesis models of computations are expressiveness [Edw97] and suitability for optimization. While high expressiveness implies wider application domain, suitability for optimization often implies efficient implementation. For the sake of conceptual simplicity, in this work, we target the synchronous data flow (SDF) model of computation [Edw97]. This computation model is often used to facilitate optimization-intensive compilation for ASIC platforms (filtering, frequency transforms, wavelet computation structures, error correction coding, encryption, etc.) [Rab91]. Modern single-chip applications (for example, MPEG audio/video encoding/decoding or wireless communication protocols and data transfer) are by default not developed based only on the SDF computation model. However, most of the subfunctions (DCT and FFT transforms, Huffman coding, etc.) in such applications can be modeled using the SDF computation model. The computation, that does not follow the SDF model, can be abstracted using the semi-infinite stream random access machine (SISRAM) model. The SISRAM model is created by removing a requirement for algorithm termination from the standard RAM model [Aho83]. It is important to stress that the cut-based debugging approach is not limited
59
to a specific computation model. However, for each computation model, a cut definition has to be established to satisfy the generic concept of a cut: a cut at time T is defined as a subset of variables from which any other variable computed after T can be computed. In this manuscript, we describe cut selection for ASICs that are synthesized using control data flow graphs (CDFG). This simplification is assumed because of three reasons: brevity, availability of synthesis tools, and the fact that the SDF computation model corresponds to many data-intensive multimedia, communications, and wireless applications. In our experiments, we used Silage [Rab91] as a specification language for the ASIC implementation. We assume fully deterministic behavior of hardware and a continuous semiinfinite operation mode (not necessarily periodic). We do not impose any restriction on the interconnect scheme of the assumed hardware model at the registertransfer level. Registers may or may not be grouped into register files. Each hardware resource can be connected in an arbitrary way to other hardware resources. We do not impose any restrictions on the number of pipeline stages of the employed functional units. The design is fully specified and its functionality and realization is not disturbed by the debugging process, with the exception of enabling the user to write into specific controllable registers. In a fully specified design each operation, variable, and data transfer is scheduled and assigned to a particular instance of hardware resource in one or more control steps. In order to support debugging, we allocate additional debugging hardware to satisfy all debugging requirements. The goal is, of course, to add as little hardware as possible. In particular, we do not allow increase in the number of I/O pins, since this is a constraint that dominates other hardware constraints in modern designs.
60
4.2
The New Approach: Cut-Based Integrated Debugging
The key idea behind the cut-based approach for the integration of simulation and emulation is to leverage on the strong aspects of each of the functional execution domains. The resulting debugging technique provides fast, observable, and controllable functional execution. The essence of the idea is the establishment of the concept of a complete cut. A complete cut of a computation represents a subset of variables sufficient to correctly continue the computation regardless of the values of variables that are not part of the cut, i.e. all variables in the computation that are not part of the cut can be recomputed using only the cut variables. The full system state is a straightforward example of a state of a complete cut. However, a complete cut may be substantially smaller than the system state. Clearly, if one has complete controllability and observability over the state of all variables in the complete cut for a specific breakpoint, the computation can be continued functionally correctly from that breakpoint. A cut contains the complete information about the history of the computation process and its primary inputs until a given point in time (breakpoint). For the sake of brevity, from now on when we say cut, we mean a complete cut. Our cut-based functional debugging approach is conducted using the following three phases of the design and debugging process: • Design post-processing. • Phase 1. - Defining the cut from the RT-level design specification. • Phase 2. - Augmentation of the design specification with cut statements which support controllability and observability when the design is executed in a debug mode; and
61
• Debugging process. • Phase 3. - Simultaneous and coordinated design execution of the fabricated or emulated design and the appropriate simulator for efficient debugging. In the first phase, a computation iteration at the behavioral-level of specification is logically partitioned into two or more components such that the cut between the partitions is complete. The synthesis support for exchange of information between simulation and emulation has the following three degrees of design freedom. • The determination of variables which form the cut; • The determination of the exact control step when the state of a particular variable is read or replaced by a user specified state. • The assignment of specific sets of I/O pins used to transfer variable states to or from the chip. It is important to notice the optimization trade-off involved in finding the optimal cut. The optimization procedure has a unique goal: to add minimum hardware resources into the initial design, while obtaining its full controllability and observability. A cut with a minimum number of variables seems to be an attractive solution. If those variables are simultaneously alive, more registers that hold the variables of the cut-set have to be provided with register-to-I/O pin interconnects. Therefore, a favorable cut is the one which consists of variables with long disjoint life-times and has the property that a small number of machine registers contain the cut variables.
62
Once an optimal cut is found, the next design problem is to define the sequence of control steps in which the variables are dispensed out of the chip. The freedom of transferring the cut state is limited due to the control steps when the I/O pins are busy. Due to lack of idle cycles, in some cases, all variables of the cut cannot be transferred over the I/O pins of the emulator. One straightforward solution to this problem is to allocate near-minimal buffer to hold the unscheduled variables. Potkonjak, Dey, and Wakabayashi have utilized the concepts of pipelining debugging variables for improving their scheduling and assignment freedom and use of I/O buffers for improving resource utilization of I/O pins [Pot95]. They do not search for a complete cut of a computation. Instead, they derive provably optimal bounds for the maximum cardinality of the set of controllable and observable variables for a given design specification. Most importantly, they have developed a non-greedy heuristic minimization algorithm for I/O buffer allocation, which can be successfully used in the cut-based debugging framework. In the second phase of the synthesis approach, the original design specification is augmented with additional resources that enable design observability and controllability. For example, the following input operation is incorporated to provide complete controllability of variable V ar1 using user specified input variable Input1 : if (DEBUG) then V ar1 = Input1 ; In the case of pipelined functional units, their pipeline latches are not subject to inclusion into cuts. However, for programmable platforms with states inaccessible by instructions (pipeline, branch predictors) and memory hierarchies, the problem of outputting the machine cut state becomes a time-lengthy process. The hardware/software co-design platform presented in this work encapsulates a complex environment with a set of programmable cores, a number of ASIC accelerators, and a memory hierarchy.
63
4.3
Synthesis for Debugging
In this section, we overview the key optimization problems involved in integrating debugging resources into a design specification for full controllability and observability. We present a set of techniques that add minimal hardware resources to a given design specification in order to achieve the design-for-debugging objectives. The determination and integration of inserted debugging resources is performed by the following sequence of tasks. First, the optimal cut is selected based on the analysis of the computation control data flow graph (CDFG). The goal is to identify a subset of variables that represent a CDFG cut, such that all variables in the set are stored in minimal number of registers. In addition, the computation graph and timing bounds have to allow all variables in the cut-set to be output from the chip through a designated set of I/O pins within a single or multiple computation iterations. This task is explained in detail in Subsection 4.3.2. Subsection 4.3.3 describes the algorithm that searches for an optimal scheduling of cut-set variables with respect to control steps at which a subset of available I/O ports is idle. Finally, the algorithm presented in Subsection 4.3.4 finds the minimal cardinality set of register-to-port interconnects that enables scheduling the cut-set variables to available ports. After cut-set variables are assigned and scheduled, the initial specification is updated with the set of resources that enable cut-set I/O. The chip is then ready to be fabricated or emulated.
4.3.1
Background Definitions
Before we present the formal description of the encountered problems and developed algorithms, we introduce a set of definitions that build the formal foundation
64
for our debugging methodology. A CDFG of a computation iteration i is a directed graph G(N, P I, P OU T, D, E) with four types of vertices: data operations N , primary inputs P I, primary outputs P OU T , and state delays D; and data precedence edges E. Each data precedence edge has a single source and a single sink vertex. Primary inputs can be used only as sources to edges. A primary output can be used only as a sink of one edge. Each data operation Ni has at least one incoming and at least one outgoing data precedence edge. All edges with a common source Ni represent the variable Vi generated using data operation Ni . Each operation Ni ∈ N is labeled with an integer Li that specifies the number of control steps required to execute operation Ni . State delays are used to distinguish the computation state between two consecutive iterations. Each state delay Di can be a sink to only one edge. The assumed CDFG definition can be easily extended with the following two types of control edges: (i) weighted edges can represent control precedence information (for example, if two operations are connected with a control precedence edge weighted W , then the execution of the source operation trails the execution of the sink operation for W control steps) [Rab91]; (ii) control edges can be used to create loop and if-then-else macro constructs as presented in [Edw97]. Since the design-for-debugging process is performed after the RT-level synthesis and thus after the operation scheduling, the control edges of type (i) do not impose constraints on the presented debugging methodology. The selected cuts partition all trajectories in the control flow induced by edges of type (ii). According to the allocated resources and data and control dependencies, during the behavioral design process, the CDFG is scheduled and assigned to the allocated hardware resources such that it can be executed in a particular number of control steps. The lower bound in the number of control steps required for ex-
65
ecution on |N | functional units is equivalent to the critical path of the CDFG, i.e. largest sum of operation labels along a path from a state delay in computation iteration i to a state delay in the next successive computation iteration i + 1. Definition 1. A schedule of variable Vi in a CDFG is determined by the control step Cistart when Vi is created and the control steps Cif irst , Cilast when Vi is used for the first and last times, respectively. Definition 2. A register assignment of variable Vi in a CDFG is a m-to-1 mapping Vi → Rj to a register Rj from the set Rj ∈ R of all registers in the ASIC. Definition 3. Read life-time of variable Vi stored in register R begins at the control step when variable Vi is created until the control step when variable Vi is overwritten by another variable Vj or the next iteration value for Vi . Definition 4. Write life-time of variable Vi stored in register R starts at the control step when variable Vi is computed and ends at the control step when variable Vi is used for the first time. Definition 5. A port is a set of K I/O pins. When variable V is assigned to port P , then V is output or input in its entirety through port P in one control step. An example of a scheduled and assigned CDFG and the accompanying definitions are depicted using Figure 4.3. Registers that store the variables are R1, R2, and R3. Exact clock cycles when operations are executed are also depicted. For example, the last control step, when variable stored in R1 is used, is C1. Since no variable is stored in R1 after C1 until C3, it is said that the read life-time of the variable stored in R1 spreads over the entire iteration. Write life-time of a variable can be observed on the example of variable stored in R2. This variable is computed at control step C1 and used for the first time in the next consecutive
66
control step. Hence, its write life-time includes only the control step C1. There are no restrictions imposed on the type of data operations. Since we target debugging designs at the behavioral level, we consider operations such as addition, subtraction, multiplication, and division. State delays Primary input
IN
D1
D2
D3
R1 Variable V1 stored in R2. It is created at control step C1. It is used for the first time in C2. Therefore its write-life-time spans only over C1.
N1
Data operations
R3
R2
+
C1
N2
*
+
N3
R2
C2 The bold edges represent an example of a computation complete cut.
R1
N4
*
R2
C3 D1
D2
Primary output
R3 D3
OUT
Figure 4.3: An example of a scheduled and assigned control data flow graph and the accompanying definitions. Primary inputs and outputs, state delays, data operations, data precedence edges, register assignment, variable write life-time, and a complete cut example are illustrated.
4.3.2
Cut Selection
In this subsection, we introduce two definitions of a cut of an SDF computation and present effective algorithms for cut selection and allocation of hardware resources that enable I/O of the cut state. The two different cut definitions enable exploration of certain trade-offs in the cut-selection process. The first definition of a cut imposes a limitation that all contained variables must be selected from a single computation iteration. The second definition relaxes this requirement
67
by enabling the search for a cut-set to be conducted among variables in several consecutive computation iterations. While cuts which obey the first definition require smaller trace capturing devices and induce lower computation initiation start-up times, the cuts formed according to the second definition frequently require less hardware resources. Definition 1. Single Iteration Complete Cut. A complete cut is a set of variables generated within one computation iteration that cuts all possible paths in the computation. Therefore, the goal of the cut-set search algorithm is to, given a computation control data flow graph, find a register subset of minimal cardinality that stores all the variables of at least one complete cut. Before we commence with the algorithm description, note that only observability-related algorithms will be presented. The algorithms that support controllability are identical with the exception that write life-times are used in place of read life-times. If the designer desires to use the same cut for observing and controlling the computation, then the cut should be determined by the controllable version of the proposed algorithms. The initial problem formulation is determined using the standard GareyJohnson format [Gar79]: PROBLEM: Optimal Cut-Set for Debugging. INSTANCE: Given a control data flow graph with read life-times of its variables, variable-to-register assignments, P ports, set S of control steps when each port is busy, and integer K. QUESTION: Is there a subset of variables V such that each path in the control data flow graph CDF G contains at least one variable Vi ∈ V , the cardinality of the set of registers that contains each variable Vj ∈ V equals K, and there exists such schedule that each variable Vj ∈ V can be output through P ports at control
68
steps not included in S? The NP-completeness of this problem can be proven by restriction to the FEEDBACK ARC SET problem (GT8, pp.192, [Gar79]). The restriction is made by assuming that no hardware sharing is possible, i.e. each variable Vi in the CDFG is stored in a separate register Ri . Since even the problem of proving whether an arbitrary set of variables represents a CDFG cut is of polynomial linear complexity, we transform the problem into a computationally less demanding task and apply problem partitioning and most-constrained least-constraining heuristics as the fundamental approach to search the transformed solution space. The pseudo-code of the developed algorithm is presented in Figure 4.4. InputSensitiveGraph ISG = Construct ISG(CDF G). ISG = Input Sensitive T ransitive Closure(ISG, CDF G). Repeat D = Input Sensitive Dominating Set(ISG). until Schedule(D, ListOfPorts) != EXISTS.
Figure 4.4: Pseudo-code for the cut search algorithm. The developed algorithm constructs the solution based on the analysis of the input sensitive (ISG) representation of the original CDFG. The ISG is built from the CDFG according to the pseudo-code in Figure 4.5. The idea behind this transformation is to create a graph-like structure which enables fast check whether a subset of variables is a cut. Each node in the ISG represents an operation from the CDFG, contains a single output that represents the output variable of that operation, and contains a number of inputs which represent the operands. A cut of a computation is a selection of node outputs which covers all inputs of all nodes. An important step in the algorithm is the input-sensitive transitive closure operation which builds the dependencies between operations, i.e. nodes
69
in the ISG. This operation is described using the pseudo-code in Figure 4.6. An example ISG, which corresponds to the CDFG illustrated in Figure 4.3, is shown in Figure 4.7. The dotted edges are added while applying the input-sensitive transitive closure procedure. For each node Ni ∈ CDF G Create a node Mi ∈ ISG. For each edge ENi ,Nj directed from Ni to Nj Create an input port Mj,m for Mj , where m is the index of the input port. For each edge ENi ,Nj directed from Ni to Nj Create an edge EMio ,Mj,m which connects Mio and Mj,m , where Mio is the output port of Mi , and Mj,m is mth input of Mj . Comment: Each primary input Pi of the CDF G is ignored. Schedule and assign each ISG node (variable) as its parent CDF G node (variable).
Figure 4.5: Construct ISG(CDF G) - pseudo-code for construction of the input sensitive graph. For each pair of edges EMao ,Mb,m , EMbo ,Mc,m ∈ ISG such that EMao ,Mb,m connects Ma to Mb and EMbo ,Mc,m connects Mb to Mc If the difference in control steps between the starts of read life-times of nodes (variables) a and c is less or equal than the total number of control steps in one iteration Insert edge EMao ,Mc,m connecting Ma and Mc,m .
Figure 4.6: Input Sensitive T ransitive Closure(ISG, CDF G) - pseudo-code for computing the input-sensitive closure of the CDFG. Using the definition of scheduled and assigned input sensitive graph, the initial problem can be reformulated into the following standard Garey-Johnson format:
70
PROBLEM: Optimal Input Dominating Set of an Input Sensitive Graph. INSTANCE: Input sensitive graph ISG with read life-times of its variables, variable-to-register assignments, P ports, associated set S of control steps when each port is busy, and integer K. QUESTION: Is there an input dominating set of variables V such that each input is covered with at least one variable in V , the cardinality of the set of registers that contains all variables from V equals K, and there exists a schedule such that all variables in V can be output through P ports at control steps not included in S? N2
Original data precedence edges inherited from the CDFG
N4
+ N1
Data operation operands
Edges added due to the input sensitive transitive closure of the CDFG
N1
N1
N3
+
* N2
N3
N1
N3
* N4
Figure 4.7: Example of an input sensitive graph which corresponds to the CDFG shown in Figure 4.3. Each node corresponds to a data operation Ni in the original CDFG and has a set of inputs which correspond to the operands of Ni . The edges in the graph are either inherited from the original CDFG or created using the input-sensitive transitive closure procedure.
71
The NP-completeness of this problem can be proved by reduction to the GRAPH DOMINATING SET problem (GT2, pp.190, [Gar79]). The restriction simplifies the ISG in such a way that each node has only one input and each variable Vi ∈ ISG is stored in a dedicated register Ri . We developed a novel heuristic algorithm for this problem. The pseudo-code of the proposed heuristic technique is presented in Figure 4.8. The algorithm generates a large number of candidate node subsets which have the property of being cuts. Although, the cuts are generated probabilistically uniformly with respect to the search time, the most-constrained least-constraining objective function that evaluates the candidate cut-sets is run-time dependent. In the beginning, the algorithm favors most-constrained solutions with few storing registers. As the search progresses, the cost dominating factor becomes the cumulative length of read life-times of all selected variables (such variables are least-constraining). At that point variables with non-overlapping read life-times are favored. This approach enables more assignment freedom for the final variable-to-port scheduling. The main disadvantage of defining a cut according to The First Definition is the fact that it does not have the flexibility of outputting a computation cut over a consecutive span of computation iterations. The following technique of selecting a computation cut enables this property and reduces the amount of hardware resources augmented for cut I/O. Definition 2. Multiple Iteration Complete Cut. A complete cut is a subset of variables which bisects all cyclic paths in the control data flow graph of a computation. The targeted optimization problem of finding a cut which is contained in minimal number of registers and a schedule according to which all variables of the cut can be output, can be defined using the standard Garey-Johnson format:
72
Preprocessing: CU Tbase = Vioutput ∈ ISG (output variables) CU Tbase ∪ = variables read-alive during the entire iteration. Repeat Select a random subset of nodes CU T ∈ ISG, CU Tbase ∈ CU T such that CU T covers all node inputs in ISG. Set best solution CU T ∗ = CU T . until there does not exist a schedule such that CU T can be output through P ports at control steps not included in S. Repeat GLOBAL times Unselect random subset of nodes ∈ CU T such that at least one node input remains uncovered. Randomly select a subset of nodes subCU T from (ISG - CU T ) which covers the uncovered set of inputs. Merge subCU T and CU T . If Cost(CU T ) < Cost(CU T ∗) If there exists a schedule such that CU T can be output through P ports set CU T ∗ = CU T . Repeat LOCAL times CU T + = CU T ∗. Unselect random subset of nodes ∈ CU T + such that at least one node input remains uncovered. Randomly select a subset of nodes subCU T from (ISG - CU T +) which covers the uncovered set of inputs. Merge subCU T and CU T +. If Cost(CU T +) < Cost(CU T ∗) If there exists a schedule such that CU T can be output through P ports set CU T ∗ = CU T +. Return: CU T ∗ as the resulting input dominating set.
Figure 4.8: Input Sensitive Dominating Set(ISG).
73
PROBLEM: Optimal Cut-set for Debugging (II). INSTANCE: Given a control data flow graph with read life-times of its variables, variable-to-register assignments, P ports, associated set S of control steps when each port is busy, and integer K. QUESTION: Is there a subset of variables V , such that when removed from the CDF G leaves no directed cycles in the CDF G, the cardinality of the set of registers that contains all variables from V equals K, and there exists such schedule that each variable Vj ∈ V can be output through P ports at control steps not included in S? The specified problem is an NP-complete problem since there is an one-to-one mapping between the special case of this problem, when all operations in the computation are executed exactly the same number of times, and the FEEDBACK ARC SET problem (GT8, pp.192, [Gar79]). The developed heuristic algorithm for this problem is summarized using the pseudo-code in Figure 4.9. The heuristic starts by logically partitioning the graph into a set of strongly connected components (SCCs) using the depth-first search algorithm [Aho77]. This algorithm has complexity O(V + E), where V is the number of vertices and E is the number of edges in a graph. All trivial SCCs which contain exactly one vertex are deleted from the resulting set since they do not form cycles. Then, the algorithm iteratively performs several processing steps on each of the non-trivial SCCs. At the beginning of each iteration, to reduce the solution search space, a graph compaction step is performed (pseudo-code presented in Figure 4.10). In this step each path P : A → B that contains only vertices V ∈ P, V = A with exactly one variable input is replaced with a new edge EA,B which connects the source A and destination B and represents an arbitrary selected edge (variable) of the same path. For each edge, a list of registers that store the compacted
74
Create a set SCC = ComputeScc(CDF G(V, E)) of strongly connected components [Aho83] For each SCCi ∈ SCC If |SCCi | = 1 delete SCCi from SCC Repeat LOOP S times Repeat CUT = null While SCC = empty For each SCCi ∈ SCC GraphCompaction(SCCi ) For each node Vi,j S = ComputeScc(SCCi − Vi,j ) |S| OF (S) = (1 + α) i=1 (|Si | · Edges(Si ) · Lif eT ime(Si )) · Registers(S)4 ), 1 where α is random number α ∈ {0, |SCC| 2 }, Lif eT ime(Si ) returns
the read life-time of variables in Si , and Registers(S) returns the number of registers which store all variables in S Select vertex Vi,j which results in minimal OF (S(Ei,j )) Delete Vi,j from SCCi SCC = S(Vi,j ) For each SCCi ∈ SCC If |SCCi | = 1 delete SCCi from SCC End For CU T = CU T ∪ Vi,j until Schedule(D, ListOfPorts) != EXISTS. If |CU T | < |BEST CU T | then BEST CU T = CU T Return BESTCUT
Figure 4.9: Pseudo-code for Optimal Cut-set for Debugging (II) search.
75
variables is maintained. For each vertex Vi ∈ SCCi If Vi has exactly one input edge Ej,i with a source in vertex Vj For each edge Ei,k Create edge Ej,k and delete Ei,k Delete Ej,i and Vi
Figure 4.10: Pseudo-code for graph compaction. In the next step, an objective function decides which node (variable) in the current set of SCCs is to be deleted. The function analyzes, for the deletion of each vertex, the cardinality of the newly created set of SCCs, the registers that are storing the variables in these SCCs, the length of the read life-times of variables in the SCCs, and the vertex cardinalities of the new set of SCCs. The vertex that results in the smallest objective function is deleted from the set of nodes as well as all adjacent edges. The deleted vertex is added to the resulting cut-set. The process of graph compaction, candidate node deletion evaluation, node deletion, and graph updating is repeated while the set of non-trivial SCCs in the graph is not empty. The set of nodes (variables) deleted from the computation represents the final cut-set selection. Consider the example shown in Figure 4.12. The CDFG of the third order Gray-Markel ladder IIR filter, shown in Figure 4.11, has only one non-trivial SCC. The graph compaction step is explained using Figures 4.12(a,b,c). Initially, vertex B is merged with vertex A, which imposes variable W to be merged with variable V . Next, the shaded nodes in Figure 4.12(b) are merged as well as the corresponding variables. The node compaction process results in a SCC presented in Figure 4.12(c). Figure 4.12(e) illustrates the resulting set of SCCs, after node M is deleted from the SCC depicted in Figure 4.12(d).
76
A1
A3
C1
+
+
IN
+
A5 +
C3
+
A7
+
D1
R5
D3
+
D2 C5
C7
D3
+
+
A10
A11
C6 A12
+
a) Control flow graph of a third order Gray Markel ladder filter
A1 R2
*
C1
R3
R1 +
OUT
+
R1
1
A9
2 C4
D2
0
A8 D1
IN
A6 +
A4
A2 +
C2
+
A2
A7
R4
3 C7
*
4
+
A3
*
C2
R1
R2
5
R1 R6
A4
+
C4
*
+
A8
6 7 A10
+
+
R5
C3
R3
8 R6
A6
+
C5
*
A5 R1
*
R1
+
A9
9 10 a) Assigned, allocated and scheduled control flow graph of a third order Gray Markel ladder filter
R2
R4 A11
11
+
R5 R6
A12
+
*
C6 R3
R5
12 D1
D3
OUT
D2
Figure 4.11: The unscheduled (a) and scheduled and assigned (b) control data flow of a third order Gray-Markel ladder filter. Finally, lets compare the cuts retrieved according to the two different definitions on the example of a Gray-Markel ladder filter. In Figure 4.11 the variables of a cut-set that corresponds to the first definition are represented as bold dotted lines. In order to define a single iteration, the output variables of adders A6, A8, and A9 stored in registers R4, R2, and R3 respectively, are output as a cut. Such a set of variables fully determines the state of the machine. Consider the cut that corresponds to the second definition. It contains three variables: the outputs of adders A1, A3, and A5, all stored in register R1. This subset of variables bisects all cyclic paths in the CDFG; by deleting the edges in the CDFG which represent these variables, all cyclic paths are removed. In order to use these variables in restarting verification, cut values from three consecutive
77
W
(a)
W B
V
M (d)
A V is the node considered for deletion
(b)
e)
A Bold edges and nodes represent the remaining SCCs when V is deleted (c)
A
Figure 4.12: Finding the cut-set of the third order Gray-Markel ladder IIR filter. Subfigures (a,b,c) demonstrate the node merger procedure. Subfigures (d,e) illustrate the removal of a node from the set of SCCs and its inclusion in the set of selected cut variables. iterations are required before the machine state is correctly restored.
4.3.3
Variable Scheduling
The design has to be able to output all variables in the cut-set through a limited number of I/O ports within a single or multiple computation iterations. Since the procedure which checks whether this is achievable is invoked every time a candidate cut-set is found, we propose a most-constrained least-constraining heuristic technique to quickly provide an answer to this question. If the answer is positive, a search for the minimal cardinality set of register-to-port interconnects is performed. The interconnects are such that cut-set variables can be output through
78
the ports at idle control steps. In this subsection we present the algorithm for the first subproblem. The problem can be formulated using the following format: PROBLEM: Output Scheduling of a Set of Variables in a CDFG. INSTANCE: Set of variables V , each with its read life-time, P ports and associated set S of control steps when each port is busy. QUESTION: Is there a schedule such that all variables can be output through P ports at control steps not included in S? The NP-completeness of this problem is proved by restriction to the SEQUENCING WITH RELEASE TIMES AND DEADLINES problem (SS1, pp.236, [Gar79]). The restriction is imposed by selecting only those variables in V that are not in the set of state (delay) variables D. The heuristic developed for this problem schedules variables using a greedy strategy. First, the constraint of each control step Ci in the scheduled and assigned CDFG is calculated as the number Civar of variables being read-alive during that control step. For each variable Vi its constraint is computed as a sum of Cjvar , where j is in the set of control steps when Vi is read alive. Consequently, the most constrained variable is assigned to the least constraining control step. The process of computing constraints and scheduling variables to distinct control steps is iterated until all variables in the cut-set are not output scheduled. Pseudo-code of the proposed most-constrained least-constraining heuristic is presented in Figure 4.13. The algorithm is described using Figure 4.14. There are 12 variables in the cut-set, three output ports which are all busy during control step C4 . The schedule is found using the described heuristic by sequentially assigning variables to ports as depicted. First, variables V1 , V8 , and V12 are scheduled to C5 , and V11 to C3 since they do not have a choice. In the next step, since all ports are used at C5 , we schedule V9 to C1 and V6 to C3 . Next, variable V10 is the most-
79
Repeat until no additional variable can be scheduled For each variable Vi If Vi can be scheduled only in one control step Cj , schedule Vi to Cj and any port Px available at Cj . For each control step Ci Set Civar as number of all variables which are read-alive at Ci . (Cjvar ·ReadAlive(Vi ,Cj ) Schedule variable Vi with max(Cost(Vi )) = max( j=AllControlSteps ) ReadLif eT ime(Vi )−1 to the control step Ck which has min(Ckvar ) and any port Px still available at Ck . ReadAlive(Vi , Cj ) returns 1 if Vi is read-alive at Cj . ReadLif eT ime(Vi ) is the number of control steps for which Vi is read-alive.
Figure 4.13: Pseudo-code for the cut-set output scheduling heuristic. constrained according to the formula in the pseudo-code so we schedule it to its least-constraining control step C2 . Then V7 and V5 have no choice and have to be scheduled at C2 . Finally, V2 , V3 , and V4 are scheduled to C1 and C2 according to the already described principles.
4.3.4
Variable-to-Port Scheduling
The second phase of the synthesis for the debugging process has as an input the selected cut-set from the first phase and the information about the read life-times of each variable as well as its storing register. The available number of output ports is also available. The key design question is to assign each cut-set variable to a specific output port in such a way that the number of connections between registers and I/O ports is minimal. We present the problem in the standard Garey-Johnson format: PROBLEM: Optimal Output Scheduling of a Set of Variables in a CDFG for Debugging.
80
V2 V3 V4
Control Steps
C1
5
C2
Variables
5
V9 2
5
V5
V7
4
4 V6
C3
Order of decisions of the sequencing algorithm
V10
V11
2
3
1
V1
V8
V12
1
1
1
C4
C5
Figure 4.14: Example of output scheduling. INSTANCE: An ordered set of variables V , each with its read life-time and designated register, P ports, associated set S of control steps when each port is busy, and integer K. QUESTION: Is there a schedule such that all variables can be output from the chip through P ports at control steps not included in S, and the cardinality of register-port connections equals K? If the optimization demand in the problem is ignored, the proof that this problem is NP-complete is equivalent to the NP-completeness proof for the scheduling problem described in the previous subsection. We have partitioned this problem into two fully modular optimization subproblems. Initially, register-to-port assignment is performed such that optimization requirements are met and variables of the proposed cut-set can be output through all designated ports. In the second phase, for each port Pi , all variables assigned to Pi are scheduled for output. The pseudo-code for the combined algorithm is presented in Figure 4.15. For the assignment problem, the developed heuristic iteratively tries to as-
81
Interconnects = |R| Repeat LOOP S times Repeat Interconnects times Connect register R with max(P robabilisticCost(R)) to port P with max(P robabilisticCost(P )) End Repeat If BipartiteM atching(variables, ports, interconnects) == EXIST S Decrease(Interconnects) End Repeat
Figure 4.15: Pseudo-code for the variable-to-port scheduling heuristic. sign as few as possible most-constrained registers to least-constraining ports such that the final variable-to-port scheduling is achievable. Objective function that quantifies the constraint of register R is given as: Cost(R) =
Vi ∈R
Cjvar j=AllControlSteps ReadLif eT ime(Vi )−1
where ReadLif eT ime(Vi ) returns the number of control steps for which Vi is read-alive. Similarly, the objective function that quantifies the constraint of port Px is: Cost(Px ) =
ReadLif eT ime(Vj )·Assigned(Vj ,Px ) j=AllV ariables 1+N umberOf RegistersAlreadyAssignedT oPx
where Assigned(Vj , Px ) returns 1 if Vj is already assigned to Px . Otherwise, it returns 0. In the conducted experiments we used a probabilistic version of the described heuristic which iteratively generated randomized solutions. The solutions were still guided by the described objective functions augmented with a certain random offset.
82
Once registers are assigned to ports, variable-to-port scheduling is performed only for variables stored in registers assigned to a single port. This problem is equivalent to the problem of maximum bipartite matching (pp. 601, [Cor90]) which can be efficiently solved in polynomial time. A bipartite graph G for each port P is constructed. The first partition of G contains a node for each control step in the computation iteration for which port P is not busy. A node in the other partition is created for each variable assigned to port P . Edges are drawn between each node which represents variable V and all nodes in the first control-step partition associated with control steps for which V is read-alive. To efficiently solve this problem we use the Ford-Fulkerson method which runs in O(V E), where V is the number of vertices and E is the number of edges in graph G [Cor90]. The presented algorithm assumes that the required number of I/O ports is sufficient to enable the I/O of the cut state. Since this number is often small and not known at design time, we perform an exhaustive binary search for the smallest number of I/O ports, which can satisfy the scheduling constraints. This search is performed as an outer shell to the scheduling and variable-to-port matching heuristic.
4.4
Experimental Results
In order to evaluate the developed debugging technique and accompanying algorithms, we have applied them on several benchmark designs. The examples were collected from the following technical manuscripts: 8th order continued fraction IIR filter, linear GE controller, Volterra filters, long echo canceler, wavelet filter, modem filter, Motorola C133 filter [Gue93, Pot95, Rab91], and a real-life avionics VTSOL controller [Huy93]. For all experiments, HYPER was used as a behav-
83
ioral compiler to obtain RTL implementations [Rab91]. Design-for-debugging analysis was performed to determine the cuts and the extra hardware overhead needed to support the proposed debugging technique. Tables 4.1 and 4.2, column 1, list the set of designs evaluated. For each design, an optimized (1st and 2nd rows) and a non-optimized version (3rd and 4th rows) were used. Additionally, on each version, both tight and more relaxed performance constraints were assumed. Optimized versions were obtained by applying scripts for speed optimizations [Hon97]. Available control step budgets, equal to the computation’s critical path and twice that amount, were used for the tight and more relaxed performance constraints, respectively. Tables 4.1 and 4.2, columns 3-7, describe the behavioral structure of the designs in the form of available control steps, critical path, number of variables, registers used in the RTL implementation, and computation states. Columns 8-10 display the structural properties of cuts obtained using the first (Table 4.1) and second (Table 4.2) cut definition respectively: the number of cut variables, number of registers used by the cut variables (each of which require register-toport connections), and additional ports needed. The experimental results point to the advantage of selecting computation cuts according to the second definition for designs which do not have large numbers of strongly connected components because of smaller cut-set cardinalities. This advantage comes at the expense of longer startup sequences during computation initialization. For example, in order to initialize correctly the computation for all design cases of the Motorola C-133 filter, its cut variable (according to the second cut definition) has to be input/output throughout 133 consecutive computation iterations. In the attempt, to input/output the cut in a single iteration 67 (or in the extremely constrained case 133) different variables have to be transferred through the I/O pins of the ASIC.
84
In 21 out of 80 cases, the available I/O ports were sufficient to support full observability and controllability. For example, the non-optimized modem filter shown in row 9 used 2 registers and required no extra ports to support its 12 variable cut. However, in several cases, extra ports were needed to obtain a design implementation, which was enabled for cut-based debugging. Extra ports were needed only for the highly optimized designs with exceptionally high sampling rates. In particular, the optimized wavelet filter and digital-to-analog converter have cuts of 15 and 76 variables, respectively. To support the required I/O of these variables, an additional 15 and 18 ports, respectively were needed, since existing ports were already fully utilized for I/O of the primary input and output. It is important to stress, that such design constraints are rarely imposed on the behavioral compilers.
4.5
Conclusion
The run-time of a design simulation results in several orders of magnitude slower functional execution with respect to emulation or fabrication. Design emulation and implementation significantly restrict design controllability and observability during functional debugging. We introduce a new cut-based functional debugging paradigm which integrates design emulation and simulation in such a way that advantages of both domains are fully utilized and result in a design approach which enables fast debugging with complete observability and controllability. We have identified the associated optimization synthesis tasks, established their computational complexity, and developed most-constrained least-constraining heuristics to solve them. The experimental results clearly indicate the power of the approach and new debugging tool on a number of real-life design examples with minimal hardware overhead.
85
Design Description
Structure
Complete Cut - Def.1
Hyper
Control
Critical
Vari-
Regi-
optim.
steps
path
ables
sters
States
Vari-
Regi-
Ports
ables
sters
added 0
8th Order
NO
18
18
35
19
8
8
1
Continued
NO
36
18
35
19
8
8
1
0
Fraction
YES
4
4
49
30
8
8
6
2
IIR Filter
YES
8
4
49
29
8
10
5
1
Linear GE
NO
12
12
48
19
5
8
3
0
Controller 1
NO
24
12
48
23
5
13
1
0
YES
6
6
48
27
5
5
5
1 0
YES
12
6
48
26
5
8
4
Wavelet
NO
16
16
31
20
15
15
1
0
Filter
NO
32
16
31
20
15
15
1
0
YES
1
1
31
31
15
15
15
15
YES
2
1
31
31
15
15
15
7
Modem
NO
10
10
33
16
8
12
2
0
Filter
NO
20
10
33
15
8
12
1
0
YES
4
4
47
29
8
8
6
2
YES
8
4
47
27
8
11
4
1
Volterra
NO
12
12
28
15
4
4
1
0
2nd
NO
12
24
28
15
4
4
1
0
order
YES
6
6
28
19
4
4
1
0
filter
YES
6
12
28
17
4
4
1
0
Volterra
NO
20
20
50
22
6
6
1
0
3rd order
NO
20
40
50
22
6
6
1
0
nonlinear
YES
8
8
50
31
6
6
1
0 0
filter
YES
8
16
50
27
6
6
1
Controller
NO
15
15
114
38
14
14
6
0
VSTOL
NO
30
15
114
37
14
14
4
0
aircraft
YES
6
6
114
46
14
14
9
2
YES
12
6
114
44
14
14
5
1
Digital to
NO
132
132
354
167
74
76
3
0
Analog
NO
132
264
354
171
74
77
2
0
Converter
YES
5
5
398
189
74
76
29
18
YES
10
5
398
178
74
76
28
8
Motorola
NO
134
132
217
121
133
133
67
0
C-133
NO
268
268
217
128
133
133
67
0
filter
YES
1
1
217
217
133
133
133
133
YES
2
1
217
129
133
133
67
67
Long
NO
2566
2566
1082
1056
1024
1027
5
0
Echo
NO
5132
2566
1082
1061
1024
1027
4
0
Canceler
YES
1088
1088
1107
1064
1024
1028
6
0
YES
2176
1088
1107
1059
1024
1027
5
0
Table 4.1: Application of the design-for-debugging step to a set of standard benchmarks for estimation of hardware overhead (according to the first definition of a cut).
86
Design Description
Structure
Complete Cut - Def.2
Hyper
Control
Critical
Vari-
Regi-
optim.
steps
path
ables
sters
States
Vari-
Regi-
Ports
ables
sters
added 0
8th Order
NO
18
18
35
19
8
8
1
Continued
NO
36
18
35
19
8
8
1
0
Fraction
YES
4
4
49
30
8
8
6
2
IIR Filter
YES
8
4
49
29
8
10
5
1
Linear GE
NO
12
12
48
19
5
8
3
0
Controller 1
NO
24
12
48
23
5
13
1
0
YES
6
6
48
27
5
5
5
1 0
YES
12
6
48
26
5
8
4
Wavelet
NO
16
16
31
20
15
1
1
0
Filter
NO
32
16
31
20
15
1
1
0
YES
1
1
31
31
15
1
1
1
YES
2
1
31
31
15
1
1
0
Modem
NO
10
10
33
16
8
4
1
0
Filter
NO
20
10
33
15
8
4
1
0
YES
4
4
47
29
8
8
6
2
YES
8
4
47
27
8
8
4
1
Volterra
NO
12
12
28
15
4
4
1
0
2nd
NO
12
24
28
15
4
4
1
0
order
YES
6
6
28
19
4
4
1
0
filter
YES
6
12
28
17
4
4
1
0
Volterra
NO
20
20
50
22
6
6
1
0
3rd order
NO
20
40
50
22
6
6
1
0
nonlinear
YES
8
8
50
31
6
6
1
0 0
filter
YES
8
16
50
27
6
6
1
Controller
NO
15
15
114
38
14
11
3
0
VSTOL
NO
30
15
114
37
14
11
2
0
aircraft
YES
6
6
114
46
14
14
9
2
YES
12
6
114
44
14
14
5
1
Digital to
NO
132
132
354
167
74
2
1
0
Analog
NO
132
264
354
171
74
2
1
0
Converter
YES
5
5
398
189
74
2
1
0
YES
10
5
398
178
74
2
1
0
Motorola
NO
134
132
217
121
133
1
1
0
C-133
NO
268
268
217
128
133
1
1
0
filter
YES
1
1
217
217
133
1
1
0
YES
2
1
217
129
133
1
1
0
Long
NO
2566
2566
1082
1056
1024
2
1
0
Echo
NO
5132
2566
1082
1061
1024
2
1
0
Canceler
YES
1088
1088
1107
1064
1024
2
1
0
YES
2176
1088
1107
1059
1024
2
1
0
Table 4.2: Application of the design-for-debugging step to a set of standard benchmarks for estimation of hardware overhead (according to the second definition of a cut).
87
CHAPTER 5 Cut-based Functional Debugging for Programmable Systems-on-Chip Due to the growth of both design complexity and the number of gates per pin, functional debugging has emerged as a critical step in the development of a system-onchip (SOC). Traditional approaches, such as system emulation and simulation, are becoming increasingly inadequate to address the system debugging needs. Design simulation is two to ten orders of magnitude slower than emulation and, thus, is used primarily for short, focused test sequences. Emulation has the required speed, but imposes strict limitations on signal observability and controllability. We introduce a new debugging approach for programmable SOCs that leverages the complementary advantages of emulation and simulation. We propose a set of tools, transparent to both the design and debugging process, that enables the user to run long test sequences in emulation, and upon error detection, roll-back to an arbitrary instance in execution time and switch over to simulation-based debugging for full design visibility and controllability. The efficacy of the developed approach is dependent upon the method for transferring the computation from one execution domain to another. Although the approach can be applied to any computational model, we have developed a suite of optimization techniques that enable computation transfer in a mixed SDFSISRAM computation model. This computation model is frequently used in many
88
communications and multimedia SOCs. The effectiveness of the developed debugging methodology has been demonstrated on a set of multi-core designs where combined emulation-simulation has been enabled with low hardware and performance overhead.
5.1
Introduction
As the complexity of designs increases, verification emerges as a dominant step with respect to time and cost in the development of a system-on-chip (SOC). For example, the UltraSPARC-I design team reported that debugging efforts took twice as long as their design activities [Yan95]. The difficulty of verifying designs is likely to worsen in the future. The Intel development strategy team foresees that a major design concern for year-2006 microprocessor designs will be the need to exhaustively test all possible compatibility combinations [Yu96]. The same team also states that the circuitry in their future designs devoted to debugging purposes is estimated to increase sharply to 6% from the current 3% of the total die area. The two most important components for efficient functional and timing verification are speed of functional execution and design controllability and observability. Traditional approaches, such as design emulation and simulation, are becoming increasingly inefficient to address system debugging needs. Design emulation - implemented on arrays of rapid prototyping modules (FPGAs) or specialized hardware - is fast, but due to strict pin limitations provides limited and cumbersome design controllability and observability. Simulation - a software model of the design at an arbitrary level of accuracy - has the required controllability and observability, but is, depending on the modeling accuracy, two to ten orders of magnitude slower than emulation [Ziv96]. For example, the functional verifi-
89
cation team for the new HP PA8000 processor reported 8 orders of magnitude difference in speed between the RT-level simulated (0.5 Hz on a workstation) and FPGA-emulated (300KHz) functional execution of their PA8000-based 200MHz workstation system [Man97]. To combine the strengths of both verification domains, in previous chapter we introduced a cut-based functional verification method that enables the verifier to seamlessly migrate the execution back and forth between design simulation and emulation. Long test sequences are run in emulation. Upon error detection, the computation is migrated to the simulation tool for full design visibility and controllability. The functional execution is switched from one domain to another by transferring the complete cut of the computation. A complete cut of a computation is a set of variables that fully determines the design state at an arbitrary time instance. The running design (simulation or emulation) periodically outputs its cuts. The cuts are saved by a monitoring workstation. When a transition to the alternate domain is desired, any one of the previously saved cuts can be used to initialize, and then continue execution with preserved functional and timing accuracy. However, this debugging technique, is restricted only to single-core, statically scheduled ASIC designs. Since current trends in the semiconductor industry show that programmable SOCs are becoming the dominant design paradigm, providing adequate verification tools for such systems is a premier engineering task. We have developed a generalized cut-based methodology for coordinated simulation and emulation of SOCs consisting of a system of programmable and application-specific cores. The methodology introduces a number of optimization problems and a need for efficient implementation mechanisms. We provide a set of tools that solve these problems for a mixed SDF-SISRAM model of computation. This computation
90
model is frequently used in many communications, multimedia, and DSP applications. We propose a suite of algorithms that effectively identifies the minimal computation state (cut) and postprocesses the system components to enable I/O of cut variables. The experiments, conducted on a set of standard multi-core benchmarks and industry-strength designs, quantify the overhead induced to enable the developed debugging paradigm. In all cases, no or negligible hardware and performance overhead was incurred while providing both fast functional execution and full design controllability and observability.
5.1.1
Motivational Example
In this subsection, using a simple multicore application-specific system, we provide an overview of the design-for-debugging techniques used to enable cut-based functional debugging. The goal of design-for-debugging is to add minimal hardware resources into the component cores, such that during idle system bus cycles, both cores can output or input their states in the shortest possible time. The optimum solution to this problem often requires interleaving of the transfers of minimal states of component cores. In order to formalize the debugging approach, we introduce the generic definition of a cut (minimal computation state) and its application to the synchronous data flow computation model. A complete cut at time T is generically defined as a subset of variables from which any other variable computed after T can be computed. We briefly review the two alternative definitions of a cut of an SDF computation presented in the previous chapter. The First Definition of a Complete Cut. A complete cut is a set of variables generated within one computation iteration that bisects all possible paths in the computation. The Second Definition of a Complete Cut. A complete cut is a subset of variables
91
that bisects all cyclic paths in the control data flow graph of a computation. IN
D1
D2
D3
D4
D5
11 C8 * 12 0 1 A5
+
+
C19
*
*
C17
D5 +
D4 +
D3 +
D2 +
D1 + *
C8
C10
A2
A4
A6
A8
A10
*
IN
+ A4 R7
C12
+ A6
* R7
+ A8 R7 * + A10 R8 R9 R10 C16 4 R7 R11 R12 * C18 5 R1 A9 + * C19 R1 6 A7 + * C17 7 R1 A5 + * C15 R12 8 R1 R11 A3 + * C13 R10 9 R1 A1 + * C11 R9 10 R8 C14
3
*
*
C15
*
*
C13
C18
C16
*
*
C11
*
* C9
*
C14
C12
C10
R7
2
A1
R10 R11 R12
R9
+ A2
C9 *
A9
A7
+
+
+
OUT A3
R8
R7
OUT D1
D2
*
D3
D4
D5
Figure 5.1: The 5th order CF IIR filter. Motivational example: system component core scheduled, allocated, and assigned CDFGs. These two definitions enable exploration of trade-offs in the cut-selection process. The first definition of a cut imposes a limitation that all contained variables must be selected from a single computation iteration. The second definition relaxes this requirement by enabling the search for a cut to be conducted among variables in several consecutive computation iterations. While cuts, which obey the first definition, require smaller trace capturing devices and induce lower computation initiation start-up times, the cuts formed according to the second definition frequently require less hardware resources. In the remainder of this subsection, we demonstrate how a cut can be used to restart a computation and
92
what are the involved trade-offs in cut selection of multi-core designs. IN +
*
*
R7
OUT C1
+
.. .. .
C8 . . . . . . . . C19
C7
R2 R3
R4 +
R1
IN
R5
R8 R9 R10 R11 R12
R6
OUT
To the chip I/O port and the programmable core
Figure 5.2: Motivational example: ASIC architectures for the 5th order CF IIR and 3rd order Gray-Markel ladder filter. Consider a SOC consisting of a programmable core and two applicationspecific cores, a 5th-order CF IIR filter [Cro75] and a 3rd-order Gray-Markel ladder IIR filter [Gra73]. The programmable core places the inputs for both filters on a shared system bus. The filters periodically read the data, process it, and place the results of the computation back onto the shared bus. A single iteration of a computation process is finished when the programmable core reads the output data from the bus. The computations running on the ASICs are statically scheduled. The corresponding control data flow graphs (CDFGs), operation scheduling, and register allocation are shown in Figure 5.1. A single iteration of computation on each core requires 13 control steps. The Gray-Markel ladder filter inputs and outputs data in control steps 0 and 12 respectively, while the CF IIR filter inputs and outputs data in control steps 11 and 10, respectively. The architecture is shown in Figure 5.2.
93
For example, consider two different cuts of the Gray-Markel ladder filter: (a) variables D1, D2, and D3 and (b) variables at the outputs of adders A1, A3, and A5, all stored in register R1. Both variable subsets bisect all cyclic paths in the CDFG - by deleting the edges in the CDFG that represent these variables, all cyclic paths are removed. The computation at iteration (i) can be restarted by injecting the values of the cut (a) in the first three control steps of the iteration (i) respectively and applying the appropriate input sequence In(i). In order to enable such data injection, registers R2, R3, and R4 have to be connected to the
D2
R3 * C6
OUT
R5 + R5
12
11
10
9
8
7
6
5
R2
D1
R6
A10 +
R4
R6
D3
A11 +
R6
C5 *
A6 + R3 R5
C4
*
+ A4 R2
A12
R1
+ A5
C3 *
+ A8 * C2
+ A3 R1 C7 *
3
4
* C1
R1 + A2
R1
2
1
R5
0
12
IN
R2
11
10
9
8
R1
R3
+ A7
R2 + A1
R4
R6 + A10
7
6
iteration(i+1)
R1
R4 R5
R5 A11 +
R6
A12 +
C5 *
A6 + R3 R5
* C4
A4 + R2
R6
5
4
+ A9
OUT * C6
R3
+ A9 R1 *
+ A5
C3
+ A8 R1
* C2
+ A3 C7 *
3
2
1
R1
R3
+ A7 + A2
R1
* C1
+ A1 R5
0
12
IN
R2
11
10
9
8
7
6
R1
R2
A11 + R4
R6 A10 +
R6
5
4
iteration(i)
R1
R4
R3 * C6 R5 + A12
R5
C5 *
A6 + R3 R5
C4
*
A4 + R2 C7 *
3
R6
R1
R1
+ A5 + A8 R1
* C2
+ A3 + A7 * C1
R1
R1
+ A2
2
1
R1
R3
D2
R2
D1
+ A1
IN
R5
0
C3 *
D3
R4
+ A9
OUT
system I/O pins.
iteration(i+2)
Figure 5.3: Motivational example: Unfolded CDFG over three consecutive iterations shows how variables D1, D2, and D3 are computed. In order to use the cut (b) to restart the computation at iteration (i+3), cut values from three consecutive iterations (i), (i+1), and (i+2) are required before the machine state is correctly restored. Figure 5.3 illustrates how variables D1(i+ 2), D2(i+2), and D3(i+2), and therefore the machine state at control step 0, are restored using the system primary inputs and cut variables (outputs from adders A1, A3, and A5) from the specified consecutive iterations. We demonstrate how the computation in that case will be executed correctly using an unfolded CDFG in Figure 5.3. The bold lines in the unfolded CDFG represent all the intermediate variables required for computation of D1, D2, and D3. The design is initialized (prior to control step 0 of iteration (i)) with arbitrary variable values. To output
94
cut (b), its variables have to be output through the I/O port at control steps 1, 4, and 7 respectively in three consecutive iterations. Note that to enable I/O of the cut (b) only one register R1 has to be connected to the I/O port. In order to facilitate efficient SOC synthesis, the system integrator has to be provided with a certain level of cut scheduling programmability. The goal is to find for each core a subset of registers of minimal cardinality such that all the variables held by these registers constitute at least one complete cut. The additional constraint for register selection is that the variables of both cuts can be output and input through the available set of system pins. According to the register assignment in Figure 5.1, one register in both cores is sufficient to provide a large set of complete cuts. In the case of the Gray-Markel filter, complete cuts are enabled by connecting register R1 to the output port. The cuts are formed by dispensing the output of either A1 or C1, A3 or C2, and A5 or C3 at control steps 1 or 2, 4 or 5, and 7 or 8, respectively. These three pairs of variables, which exclusively have to be output, define 8 different cuts. Similarly, by connecting the register R7 to the I/O pins 32 complete cuts are enabled for the CF IIR filter. The variables computed at operations either A2 or A3, A4 or A5, A6 or A7, A8 or A9, and A10 or C12 can be output at control steps 0 or 9, 1 or 8, 2 or 7, 3 or 6, and 4 or 5 respectively. Note that if the cut output schedule for the CF IIR filter is fixed at control steps 5, 6, 7, 8, and 9, either a buffer needs to be allocated to one of the filters to hold the unscheduled variables, or the cuts have to be dispensed sequentially.
5.2
Preliminaries
The architecture template used to evaluate the developed debugging method is depicted in Figure 5.4. The architecture is typical for most modern consumer
95
electronics, multimedia, and telecommunications devices. It consists of a master programmable core (MPC) and a set of application-specific (ASIC) and slave programmable cores (SPC), all connected to a shared bus. As shown in Figure 5.4, an example ASIC could be a datapath unit with registers. Alternatively, an ASIC could be a background memory. Programmable core [IP block] IN X(N) D(N) = X(N) + D(N-1)A1 + D(N-2)A2 OUT D(N) if DEBUG D(N) = IN Y(N) = D(N)B0 + D(N-1)B1 + D(N-2)B2 OUT Y(N)
Master programmable core signal to start cut-set output
Shared bus
ASIC core 1
Debug Buffer
ALU
ASIC core 2
Scheduling Logic
Register File
Figure 5.4: The targeted system: individual core architecture, embedded software, and core intercommunication. Two main, often contradictory, criteria for evaluation of system and behavioral synthesis models of computations are expressiveness [Edw97] and suitability for optimization. While high expressiveness implies wider application domain, suitability for optimization often implies efficient implementation. For this target system the following heterogeneous model of computation is assumed. The backbone of the model is the semi-infinite stream random access machine (SISRAM)
96
model. The standard RAM model [Aho83] is relaxed by removing a requirement for algorithm termination. The SISRAM model provides high flexibility with well tested and widely used semantics and syntax (C and Java). The second component of the heterogeneous model is synchronous data flow (SDF) [Lee87]. This component facilitates optimization-intensive compilation for both programmable and ASIC platforms. Using this heterogeneous model of computation, we simultaneously address the needs of both control and data intensive applications. The programmable cores use a mixture of the two models, while the ASICs use solely the SDF model. We have used C as the specification language for programmable cores and Silage [Gen90] for the ASICs. The intermediate representation used for algorithms is the control data flow graph. The cut-based debugging approach is not limited to a specific computation model. For each computation model, though, a cut definition has to be established to satisfy the generic concept of a cut. In this manuscript we describe cut selection for ASICs synthesized using the SDF computation model. This simplification is assumed because of three reasons: brevity, availability of synthesis tools, and the fact that the SDF computation model corresponds to many data-intensive applications.
5.3
Debugging and Real-Time Cut Export
This section describes the key steps of the design-for-debugging approach. Figure 5.5 illustrates the technical details of the process of cut I/O. The instrumentation of the code, which runs on the MPC, starts the cut export process. For example, as shown in Figure 5.5, it first sends a signal (start ASIC) to the ASICs that starts the cut output sequence of all ASICs. The process of their cut I/O is statically scheduled. Cuts of cores ASIC1 (Cut1) and ASIC3 (Cut2) are interleaved and
97
the cut of core ASIC4 (Cut3) is output after the cuts of cores ASIC1 and ASIC3. Since the cut I/O of the ASICs is statically scheduled, the MPC knows when the ASIC cut export is complete. At that point, the MPC polls (start SPC(i)) the cut I/O control of each SPCs in the system. Upon receipt of this signal the virtual tristate gate that controls the actual I/O of cut variables onto the shared bus is enabled. The instrumented code running on the SPC has to be able to assure that exactly one cut I/O (Cut SPC) is completed. Once its cut is dispensed, the SPC sends a signal back to the MPC that acknowledges one successful cut I/O process. At that point the MPC initiates its own cut I/O, which represents the end of the cut I/O process. SPC
Legend
MPC
ASIC 1
ASIC 3
ASIC 4
IN
MPC instruction that initiates the cut-set I/O
Computation iteration with its primary input and output PC computation with primary inputs and outputs
Cut 2
Cut 1
Cut 1 start PC
IN
IN
A
B OUT
OUT
Cut 3
IN
A
IN OUT
Cut SPC
Computation in time
Part of the computation at which the core transfers its cut
IN
start ASC
B
Cut MPC
OUT
Figure 5.5: View at the process of outputting the cut variables of all cores in the system.
98
5.3.1
ASIC Design-for-Debugging
During the design of an application-specific core, debug functionality is added as a postprocessing step. This functionality includes a set of register-to-output interconnects that enables the export of a number of different cuts and a hardware feature that enables the system integrator to select a specific cut. The motivation of enabling multiple cuts for I/O is as follows: (i) the designer often does not know the system integration constraints during the design of an individual component and does not want to modify the design upon request at integration time and (ii) the system integrator can benefit from multiple cuts if hard-to-solve scheduling instances are encountered. The I/O of variables of a particular computation cut is enabled by explicit connection of registers that store these variables to the I/O ports of the ASIC (if these registers are not already connected). On the other hand, note that one subset of registers may be used for I/O of a number of different cuts. This property of the register selection for cut I/O can be used to enable selection of the particular cut to be output at integration time. The goal is to achieve more flexible cut I/O in the cases when multiple cores are outputting their cuts on the same bus or the developed core is used as a subblock in a larger core. The integrator controls the selection of a particular cut using, for example, different control microcode.
5.3.2
Code Compilation-for-Debugging
In general, each programmable core has two components in its cut: instructionaccessible states (e.g. general-purpose registers), and states non-accessible using machine code (e.g. branch prediction hardware, caches, and pipeline latches). The part of the cut accessible to instructions is transferred using debug instruc-
99
tions that are instrumented into the original code. The portion of the cut that is not accessible by instructions can be exported in several ways. Many state-of-theart processors provide built-in debug-ports that control breakpoint, pipeline and general purpose registers, and memory access logic [Yu96, Kel97]. Alternately, flushing of caches, pipelines, and/or branch predictors can be used as means of state invalidation. However, this approach may decrease the performance of the debugging system. Such deficiency may be unacceptable for debugging real-time systems. An alternate approach is to shadow the invisible states in such a way that at the moment of cut I/O the programmable core continues processing using one copy, while the shadowed copy is transferred to the monitoring workstation. Upon transfer, the shadowed copy is updated with the latest changes. The debug instructions can be added either before or after compilation. While the pre-compilation choice requires in-source-code embedding of cut I/O instructions (for example, used in the MIPS Pixie [Smi91]), the postprocessing step encompasses object code instrumentation similar to that implemented in Purify (Purify uses this technique to locate memory access errors) [Has92]. The precompilation approach has a significant advantage due to independence with respect to the hardware platform. Debugging platforms that can provide real-time JTAG support for such an approach have been already developed and marketed [Mau86]. An example of instrumented code is given in Figure 5.4 where the programmable core executes a 2nd order direct-form IIR filter. The instruction OUT(D[N]); is used for observability and if Debug D[N] := IN; for controllability. Controllability in emulation is beneficial because of the following application. The designer can change easily the state of the computation in simulation. Next, as a part of the debugging process, using emulation controllability, the designer can transfer the
100
computation state from simulation to emulation, and hence, restart the computation in emulation with an arbitrary computation state as a starting point. The instrumentation process is performed in four phases. In the first phase, the minimal-size cut for each statically scheduled user-defined SDF computation island running on each programmable core is identified. An SDF computation island in a SISRAM computation is a set of instructions that process input data following the SDF computation model. The SDF computation islands are triggered by interrupts and executed pseudo-periodically according to the input data rate. In the second phase, the code is augmented with debug instructions that perform cut I/O. In the third phase, we identify the cut variables outside the SDF islands. Finally, we instrument the code with instructions that initiate and call the function performing the system state I/O. Usually, the call to this function is placed in the main loop of the program. Using profiling tools, the designer determines the best location and frequency of calling of this function. The fourth phase of instrumentization for debugging is currently not automated.
5.3.3
Integration-for-Debugging
The ASIC developer provides the system integrator with information about the set of cuts that can be enabled. For each ASIC, the variables and control steps at which they can be dispensed through the virtual pins of the ASIC are given. The system integrator faces three design problems. First, for each ASIC a single cut has to be selected. Second, the selected cuts, jointly with the primary inputs and outputs, are scheduled for I/O over the available set of pins. We integrated these two phases into a tight optimization loop, which searches for a feasible scheduling. Finally, if no scheduling is found, the ASIC cuts are transferred sequentially in such a way that no two scheduled cut variables are displayed on the system bus
101
in the same control step.
5.4
Design-for-Debugging: Algorithms
In this section, the optimization problems related to design-for-debugging are identified, their computation complexity is established, and efficient algorithms are developed. There are three main optimization problems related to the support for debugging: (i) finding a minimal cut of a program being executed on a PC, (ii) finding a set of register-to-output interconnects that enables a large number of non-overlapping cuts, and (iii) selection of a cut for each ASIC such that the I/O of all cuts is interleaved and conducted in minimal number of control cycles.
5.4.1
Code Instrumentation for Cut I/O
The export and import of cut variables of computations executed on programmable cores is performed by executing debug instructions embedded into the original code by a compilation postprocessing tool. The embedded instructions impose an overhead on the program performance and storage. The code size overhead is directly proportional to the cardinality of the selected cuts. The number of embedded instructions also impacts the number of required cycles and therefore, the time overhead to output the programmable core cut. These two problems are not identical since different code segments can be executed with different dynamic frequencies in the SISRAM computation model. Since in Section 6 the timing overhead is shown to be minimal on a variety of applications, we focus our attention on the first optimization problem. Solutions for both optimization problems are strongly positively correlated in the sense that a good solution to one of the problems implies high likelihood for a good solution to the other.
102
PROBLEM: PC Cut Selection. INSTANCE: Given a computation presented as directed cyclic graph and an integer M . QUESTION: I s there a subset of edges that correspond to node outputs Oi , i = 1..N such that when deleted, leaves no directed cycles in the graph, and that N < M ? The PC Cut Selection is an NP-complete problem since there is one-to-one mapping between the special case of this problem, when all operations in the computation are executed exactly the same number of times, and the problem of finding the minimal Feedback Arc Set [Gar79]. To address this problem, we have developed a heuristic summarized using pseudo-code in Figure 5.6 The heuristic initially partitions the graph into a set of strongly connected components (SCCs) using the breadth-first search algorithm [Cor90]. This algorithm has complexity O(V + E), where V is the number of vertices and E is the number of edges in a graph. All trivial SCCs, which contain exactly one vertex, are deleted from the resulting set since they do not form cycles. The algorithm then iteratively performs several processing steps on each of the non-trivial SCCs. At the beginning of each iteration, to reduce the solution search space, a graph compaction step is performed. In this step, each path P : A → B, which contains only vertices V ∈ P, V = A with exactly one variable input, is replaced with a new edge EA,B that connects the source A and destination B and represents an arbitrarilyy selected edge of the same path. In the next step, an objective function decides which node (variable) in the current set of SCCs is to be deleted. The function analyzes, for the deletion of each vertex, the cardinality and the vertex cardinalities of the new set of SCCs. The vertex that results in the smallest objective function is deleted from the set
103
of nodes as well as all adjacent edges. The deleted vertex is added to the resulting cut. The process of graph compaction, candidate node deletion evaluation, node deletion, and graph updating is repeated while the set of non-trivial SCCs in the graph is not empty. The set of nodes deleted from the computation represents the final cut selection. Create a set SCC = Scc(CDF G(V, E)) of SCCs [Cor90] For each SCCi ∈ SCC If (|SCCi | = 1) Delete SCCi from SCC Repeat LOOP S times CUT = null While SCC = empty For each SCCi ∈ SCC GraphCompaction(SCCi ) For each node Vi,j Compute scc = Scc(SCCi − Vi,j ) |scc| OF (scc) = (1 + α) i=1 (|scci | · N Edges(scci )), 1 where α is random number α ∈ {0, |SCC| 2}
Find Vi,j with minimal OF (scc(SCCi , Vi,j )). Delete Vi,j from SCCi . SCC = scc(SCCi , Vi,j ) For each SCCi ∈ SCC If (|SCCi | = 1) Delete SCCi from SCC. CU T = CU T ∪ Ei,j If (|CU T | < |BEST CU T |) BEST CU T = CU T Return BESTCUT
Figure 5.6: Pseudo-code for PC CUT SELECTION search.
5.4.2
Cut Selection and Register-to-Port Interconnection
The goal of the ASIC design-for-debugging process is to assign a minimal number of register-to-output interconnects such that large number of complete cuts can be output from the core. An additional constraint is set on the timing occurrence
104
of these cuts. Since the core developer does not know in advance the multi-core system configuration, i.e. the future scheduling constraints of the cut variables, its search for a set of register-to-output interconnects is targeted for a large number of time non-overlapping small cuts. Such a subset of registers gives the system integrator more flexibility to find a scheduling solution for interleaved cut I/O. We introduce a heuristic search objective and algorithm for such subset of registers. Heuristic definition 1. A “debugging prospective register subset” is a subset of registers Ri , i = 1..RM that defines a set of M distinct cuts Ci , i = 1..M that satisfies the following two statements: RM < M ax and OF (Ci , i = 1..M ) = M
(
M i=1
|Ci |)·(
i=1
∀V ∈Ci
Lif eT ime2 (V )
∀ControlStepCS
LiveV ariables2 (CS))
< K where function LiveV ariables(CS)
returns the number of variables alive at control step CS, K is a given real number, and M, M ax, RM are given integers. The definition of a debugging prospective register subset forces selection of registers that define a set of cuts with large cardinality, small cardinality of containing cuts, long life-times of containing cut variables, and non-overlapping lifetimes of variables in the set of cuts. These three properties of the selected subset of registers provide flexibility at various decision levels for multi-core cut scheduling. The core developer faces an optimization problem to find the debugging register subset with the smallest possible constants M ax and K where constant M ax has priority over K. PROBLEM: Debugging-Prospective Register Selection. INSTANCE: Given scheduled and assigned CDFG, real number M inOF , and integer M inCard. QUESTION: I s there a subset of registers Ri , i = 1..RM that determines a set of M distinct CDFG cuts Ci , i = 1..M , and has OF (Ci , i = 1..M ) < M in
105
and RM < M inCard? A special case of the Debugging Register Selection problem, with no register sharing among CDFG computation variables and no additional heuristic requirements, is equivalent to the problem of finding the minimal Feedback Arc Set [Gar79]. Therefore, Debugging-Prospective Register Selection is computationally intractable problem. We have developed a heuristic to search for a debugging prospective register subset in a scheduled and assigned CDFG. The algorithm is formally explained using the pseudo-code in Figure 5.7. Firstly, the algorithm partitions the CDFG into a set SCC of SCCs. Then for each register Ri , the heuristic, using the objective function described bellow, evaluates the set of strongly connected subgraphs sccj ∈ scc that are result of deletion of all variables held by register Ri . The objective function used to quantify the register selection importance is: OF R(R, SR, CDF G) =
LiveV ariables2 (CS,SR) ∀ControlStepCS 2
SCCcar(R,CDF G)·
∀V ∈R
Lif eT ime (V )
Function SCCcar(R, CDF G) returns the sum of squares of cardinalities of strongly connected components scci when all variables held by register R are deleted from the CDFG. LiveV ariables2 (CS, SR) returns for control step CS a sum of squares of number of variables alive at CS and held by the currently selected subset of registers SR. The register with the highest objective function is selected, added to the currently selected subset of registers SR, and all its variables (edges) are deleted from the original CDFG. The process of register selection is recursively repeated while the set of nontrivial SCCs is not empty. In order to provide probabilistic register selection, and therefore improve the search engine, in our implementation we multiply the original objective function value with a random number within a prespecified offset. The search for the best register selection is iterated S times (in our experiments S was equal to the
106
Create a set SCC = Scc(CDF G(V, E)) of SSC [Cor90]. For each SCCi ∈ SCC If (|SCCi | = 1) Delete SCCi from SCC Repeat LOOP S times Starting set of registers SR = null While SCC = empty For each register Ri SR Compute scc = Scc(CDF G − Ei,j |Ei,j ∈ Ri ) Select the register Rk which results in minimum OF R(Rk , SR, CDF G) and delete all variables held by Rk from SCCi SR = SR ∪ Rk SCC = scc(SCCi , Ei,j |Ei,j ∈ Rk ) For each SCCi ∈ SCC If (|SCCi | = 1) Delete SCCi from SCC If (OF (SR) > OF (BEST SR)) BEST SR = SR Return BESTSR
Figure 5.7: Pseudo-code for the debugging prospective register subset search. number of operations in the CDF G).
5.4.3
Cut Scheduling of Multiple Cores
In this subsection, we introduce an algorithmic solution that enables efficient I/O of cut variables from multiple statically scheduled ASICs. This design step is done during system integration. The optimization problem has three modular subtasks. In the first subtask, at least one cut is selected for each ASIC. The second phase encompasses a search within the common multiple (CM) of periods of all ASICs in the system for a subset of cuts that can be scheduled in the fewest successive control steps. The third task encompasses scheduling of the system
107
cut. As an input to the algorithm, the system integrator is provided a table where each row represents a list of variables that constitute an ASIC cut and a range of control steps when each variable can be read or written. In order to reduce the size of the table, we treat a list of consecutively alive variables in a single cut that are stored in the same register and that are dependent only upon the variable previously stored in that register (except the first variable in the list) as a single “compacted” variable. An example of such a cut table, which corresponds to the CDFG of the third order Gray-Markel ladder filter presented in Figure 5.1a, is depicted in Table 5.1. Since the output variables of operations A1 and C1 fulfill the above requirements and are stored in the same register R1, we represent these variables using a single variable (denoted as “compacted” in the remainder of the text). Similarly, output variables of operations (A3 and C2) and (A5 and C3) are compacted as two distinct variables with life-times that span over the life-times of the original variables. Variables and their life-times A1-C1 (1,2)
A3-C2 (4,5)
A5-C3 (7,8)
Table 5.1: Table of cuts for the 3rd order Gray-Markel ladder filter. In general, each ASIC can have a different period of its on-going computation. Therefore, we use the common multiple (CM) of all, as the system ASIC debugging period. Within this period the algorithm tries to find a feasible schedule of variables of all ASIC cuts such that the range of control steps is minimal between the moments when the first and last cut variable in the ASIC subsystem is output. The problem is formally stated in the following way: PROBLEM: Cut Selection and Scheduling.
108
INSTANCE: Given a set of cores ASIC, a set of cuts CU TCore for each core Core ∈ ASIC, a set of variables V for each cut C ∈ CU TCore , a set of control steps CSv for which each variable v ∈ V is alive, a set of CS control steps at which chip ports are idle, and integer M axRange. f QUESTION: I s there a selection f of a cut CU TCore for each core, squch that f exists a distinct control steps CSvf ∈ CS at which for each variable v ∈ CU TCore f f , w ∈ CU TCore2 are scheduled the chip port is idle, no two variables v ∈ CU TCore1
for transfer CSvf = CSwf through the chip port at the same idle control step, and that the max(CSvf − CSwf ) < M axRange ? The problem of Scheduling a Subset of Variables in a CDFG, discussed in the previous chapter, is a special case of the Cut Selection and Scheduling problem. Since the former problem is NP-complete, the Cut Selection and Scheduling problem is also computationally intractable. We developed a most-constrained least-constraining heuristic in order to provide a competitive solution to this problem. The heuristic is explained using the pseudo-code in Figure 5.8. The developed algorithm addresses simultaneously the cut selection and scheduling by integrating heuristics for their solution in a tight search loop. Initially, for each ASIC, the available cuts, all of equal cardinality, are sorted in decreasing order with respect to the average life-time of contained variables. The selection and scheduling search loop starts by selecting one cut for each ASIC from its list of available cuts. In order to provide search randomization and at the same time give priority to cuts with lower indexes, i.e. cuts that contain variables with longer average life-times, the probability of selecting a particular cut is made proportional to the square of its average variable life-time. Once cuts for all ASICs are selected, the next step is to find, within CM consecutive control steps, the smallest subset of M consecutive control steps in
109
Cut Selection Preprocessing: For each ASICi Create a list Li of cuts CSi,j in decreasing order of average life-time of contained variables. Cut Selection: Repeat LOOP S times For each ASICi Select cut CSi,j ∈ Li where j is an index selected among all other indices with probability proportional to the square of average life-time of variables in CSi,j |ASIC| Range = i=1 |CSi,j |. Repeat Find the set T IM E of Range idle consecutive control steps in the CM of periods of all ASICs that contain the cuts of all CSi,j . For each T IM Ep ∈ T IM E For each subset of |ASIC| distinct cuts of each ASIC encompassed with T IM Ep Schedule CSi,j in T IM Ep . If schedule found and in shorter time than the best schedule then best = current schedule. Range = Range + 1. until Range > BestRangeorRange == |LCM |. Cut Scheduling: [Schedule CSi,j in T IM Ep ] Repeat until all variables scheduled For each control step Ci Compute its constraint Ci .constraint as sum of 1 V.lif etime
for each variable V alive at Ci .
For each variable Vi Compute its constraint Vi .constraint as sum of constraints of control steps at which Vi is alive. Select N most-constrained tasks and exactly schedule them at control steps with the smallest sum of constraints.
Figure 5.8: Pseudo-code for the cut selection and scheduling algorithm.
110
which the variables of the selected cuts can be scheduled. The search is initiated by determining the lower bound on the range of control steps M = Mmin for which all cuts can be dispensed. This bound is equal to the sum of the cardinalities of all ASIC cuts. Next, within the frame of CM consecutive control steps, a set T IM E is found where each element T IM Ep ∈ T IM E represents a particular subset of Np consecutive control steps that contains at least Mmin idle control steps and for each variable of all ASIC cuts there must be at least one of these idle steps in which it is alive. One frame of control steps T IM Ep may “contain” more than one of the available cuts for one ASIC. Therefore, for each combination of cuts within T IM Ep a scheduling heuristic is performed. The scheduling heuristic iteratively constructs the solution by selecting N most-constrained cut variables and scheduling them exactly at the N leastconstraining control steps. The functions used to quantify the constraint are given in the pseudo-code in Figure 5.8. If feasible scheduling is found, the range of the solution Np is compared to the best current solution and if more competitive is memorized as the best. If feasible scheduling is not found, the control step range M is increased and the search procedure is repeated until for a given set of cuts scheduling is found or the current range exceeds the scheduling range of the current best solution. The best scheduling solution is the one which in minimal number of consecutive control steps inputs or outputs the variables of each ASIC in the system.
5.5
Experimental Results
We have conducted a set of experiments to evaluate the effectiveness of our system debugging paradigm. Table 5.2 shows the set of application-specific cores that were used, and the area overhead introduced by the design-for-debugging
111
postprocessing step that introduces hardware to enable cut I/O. The applicationspecific cores include a number of Avenhaus IIR filters, several linear controllers, a non-linear volterra filter, and a modem. The designs were synthesized using HYPER [Rab91]. In the first column, the name of the core is presented followed in the next six columns with core architecture data, such as the length of one iteration in control steps, number of variables in the application-specific computation, core area dedicated to the execution units, registers, multiplexers, and total area. In column eight, the number of variables in the smallest cut is presented. The first number in the sum is the number of control steps at which functional I/O is performed while the second number is the number of control steps at which the cut variables are I/O. Finally, the last two columns show the final total area and the percentage area overhead. For all designs, the cut selection algorithm succeeded to find competitive solutions in less than five seconds on a Sun UltraSPARC-II. Architecture
Debug information
Area
(mm2 )
cut
Area overhead
ASIC
Period
Variables
ALU
Reg
Mux
Total
cardinality
(mm2 )
%
Cascade
14
51
2.79
0.73
0.36
3.88
2+4
0
0
Continued Fraction
19
53
4.28
1.09
0.31
5.69
2+8
0
0
Direct Form II
10
53
10.65
1.45
0.57
12.68
2+1
0
0
Parallel
10
57
3.56
0.70
0.41
4.67
2+4
0
0
Parallel
9
57
4.70
0.71
0.38
5.79
2+4
0
0
Modem
20
50
1.83
0.71
0.25
2.79
2+1
0.01
0.3
Lin3
10
86
13.11
1.51
0.78
15.40
5+0
0
0
Lin3
15
86
6.63
1.48
0.58
8.70
6+0
0
0
Mat
15
29
2.21
0.39
0.04
2.65
4+0
0
0
Ellip
15
50
4.42
0.81
0.22
5.46
5+0
0
0
Volterra
15
40
1.41
0.35
0.13
1.88
2+1
0.06
3
Table 5.2: Debug information for the implementation of a number of ASICs.
112
The application-specific cores were integrated into a number of system configurations to test the efficiency of our cut-selection and scheduling technique. The results are illustrated in Table 5.3. While in the first column the core mix is specified, the second column presents the common system period. The last two columns present the number of variables of all cuts in the system that have to be transferred in one system period and the range of control steps in which this transfer is accomplished. By comparing the last two columns it is self-evident that we efficiently utilize the idle control steps in order to transfer cut variables. For all design integrations, the cut selection and scheduling algorithm succeeded to find competitive solutions in less than one second on a Sun UltraSPARC-II. Application-specific
System
Number of system
Control steps required
core mix
period
output variables
to output the system cut
Cascade, Modem, Direct Form II, Volterra
20
15
18
Continued Fraction, Lin3 (T=15)
19
16
17
Parallel (T=10), Mat, Ellip
15
15
15
Lin3 (T=10), Lin3 (T=15), Continued Fraction
22
21
22
Cascade, Direct Form II, Parallel (T=10), Mat, Ellip
31
24
27
Continued Fraction, Mat, Direct Form II, Modem
21
20
20
Parallel (T=10), Modem, Direct Form II, Cascade
19
18
18
Parallel (T=9), Ellip, Lin3 (T=10), Volterra
21
19
20
Parallel (T=9), Lin3 (T=10), Lin3 (T=15), Volterra
25
20
23
All cores in Table 5.2
61
57
59
Table 5.3: The efficiency of the cut selection and scheduling approach is tested on a set of ASIC core-mixes. Finally, in order to evaluate the feasibility and overhead of embedding instructions for programmable core cut I/O, we instrumented the code from the MediaBench benchmark suite [Lee97] with instructions that dispense the program cut out of a general-purpose processor. The experimental results are presented using two subtables in Table 5.4. The first column of each subtable presents the
113
name of the multimedia application. The next two columns quantify the total number of variables in the program and the cardinality of the program cut respectively. Finally, the last column presents the ratio of cut versus total variables. For all programs, the cut selection algorithm succeeded to find competitive solutions in less than a minute on a Sun UltraSPARC-II. Importantly, as a proof of concept, we outline the fact that on the average only 10.8% of all variables in the targeted benchmark suite were found to constitute cuts, and therefore, present a direct transfer cost. Application
Variables
cut
% of Vars
cardinality
for I/O
Application
Variables
cut
% of Vars
cardinality
for I/O
ADPCM.enc
22
6
27%
PEGWIT
101
14
14%
ADPCM.dec
13
5
38%
PGP
970
33
3.4%
D/A Converter
213
3
1.4%
GSM.enc.dec
140
12
8.6%
G721.enc.dec
28
2
7%
JPEG.enc
513
17
3%
epic/unepic
298
32
11%
MPEG2.dec
432
24
5.5%
Table 5.4: Total number of variables and cut cardinalities for a set of multimedia benchmarks.
5.6
Conclusion
This paper has introduced the first approach to functional verification of statically or dynamically scheduled programmable SOCs that coordinates design emulation and simulation. We have established the complexity of all optimization tasks, and developed efficient heuristics to provide competitive solutions. The effectiveness of the new approach and accompanying algorithms has been demonstrated on a set of programmable and application-specific multi-core designs where full system observability and controllability have been enabled with low hardware and performance overhead.
114
CHAPTER 6 Symbolic Debugging of Optimized Behavioral Specifications for Fast Variable Recovery Symbolic debuggers are system development tools that can accelerate the validation speed of behavioral specifications by allowing a user to interact with an executing code at the source level. In response to a user query, the debugger must be able to retrieve and display the value of a source variable in a manner consistent with what the user expects with respect to the source statement where execution has halted. However, when a behavioral specification has been optimized using transformations, values of variables may either be inaccessible in the run-time state or inconsistent with what the user expects. We address the problem that pertains to the retrieval of source values for the globally optimized behavioral specifications. We describe how transformations affect the retrieval of source values. We present an approach for a symbolic debugger to retrieve and display the value of a variable correctly and efficiently in response to a user inquiry about the variable in the source specification. The implementation of the new debugging approach poses several optimization tasks. We formulate the optimization tasks and develop heuristics to solve them. We demonstrated the effectiveness of the proposed approach on a set of designs.
115
6.1
Introduction
Functional debugging of hardware and software systems has been recognized as a labor-intensive and expensive process. This situation is likely to become even worse in the future, since the key technological trends indicate that the percentage of controllable and observable variables in designs will steadily decrease. For example, the designers of a modern superscalar microprocessor reported that the debugging process took more than 40% of the development time [Uch94]. Symbolic debuggers are system development tools that can accelerate the validation speed of behavioral specifications by allowing a user to interact with an executing code at the source level. Symbolic debugging must ensure that in response to a user inquiry, the debugger is able to retrieve and display the value of a source variable in a manner consistent with what the user expects with respect to a breakpoint of the source code. The application of code optimization techniques usually makes symbolic debugging harder. While code optimization techniques such as transformations must have the property that the optimized code is functionally equivalent to the unoptimized code, such optimization techniques may produce a different execution sequence from the source statements and alter the intermediate results. In addition, some variables in the source code may disappear in the optimized code. Debugging the unoptimized code rather than the optimized code is not acceptable for several reasons. First, it may be the case that while an error in the unoptimized code is undetectable, the error is detectable in the optimized code. Second, optimizations may be absolutely necessary to execute a program. The code without optimizations for debugging may be unable to run on a target platform, for example, because of memory limitations or constraints imposed on an embedded system. Third, a symbolic debugger for optimized code is a means
116
to find errors in the optimizer. In this paper we address the problem pertaining to the retrieval of source values for the globally optimized behavioral specifications. We present a designfor-debugging approach for a symbolic debugger to retrieve and display the value of a variable correctly and efficiently in response to a user inquiry about the variable in the source specification. We informally define the design-for-debugging problem in the following way. We are given a design or code. The code is fully specified in any high level design specification language which will be transformed to the control-data flow graph (CDFG) of the computation. The goal of our design-for-debugging (DfD) technique is to modify the original code so that every variable of the source code is debuggable (that is, controllable and observable) in the optimized program as fast as possible. At the same time, the original code must be optimized with respect to target design metrics such as throughput, latency and power consumption. A particularly important requirement is that in response to a user inquiry about a variable in the source program, the value of the variable should be retrieved or set as fast as possible. We define an important concept for developing a method that solves the problem. The golden cut is defined to be the variables in the source code which should be correct [Hen82] in the optimizaed program. The variables are timedependent. A variable named x at two different locations in the source program is treated as two different variables. By default, primary inputs and state or delay variables are included in the golden cut. The complete golden cut is a golden cut with the property that all variables which appear after the cut can be computed using only the variables in the cut, excluding primary inputs and state variables. An empty golden cut is a golden cut with no variables except for the default primary inputs and state variables in it.
117
Our proposed method can be described as follows. First, we should determine a golden cut. Next, in response to a user inquiry about a source variable xt at some point t in the source program, all the variables in the golden cut that the variable xt depends on are determined by a breadth-first search for the source CDFG with reversed arcs. For those variables except the primary inputs and state variables in the golden cut, all the statements that they depend on are identified by the breadth-first search for the optimized CDFG with reversed arcs. Those statements in the optimized CDFG are executed on the multi-core system-onsilicon under debugging. From this execution, we get the values of the variables in the golden cut that the variable xt depends on. Using these values, the variable xt is computed by the statements in the source CDFG on a workstation (usually uniprocessor) which runs a debugger program. Our proposed method requires that the golden cut be chosen to result in minimum debugging time, optimal design metrics, and as complete debugging of optimized program as possible. The last requirement stems from the fact that our method executes part of the source program to get the value of a source variable in request. Because our goal is to debug the optimized program, this portion of the source program should be minimal.
6.1.1
Motivational Example
We illustrate the proposed method with a small motivational example shown in Figures 6.1, 6.2, and 6.3. The design objective is throughput optimization. The source program is shown in Figure 6.1. The source program consists of additions and multiplications with constants. The number of clock cycles for an iteration is 9. The number in italics next to each edge (a variable) denotes the number of operations that needs to be executed on a general purpose computer
118
for retrieving the value of the variable. If there is no number by an edge, the value of the variable is available, because the variable is either an input (states or primary inputs) or output (states or primary outputs) variable. Clock Cycle
PI1
1 2
D1@1
PI2
1
+
+ 1
+ *
2
3
2 4
2
y
y
x
*
+
5 5
6
5
* 7 8
D3@1
D2@1
+ 8
6
Golden Cut A = {x,y} Golden Cut B
3 8
*
+
* 9
+ D1
D3
D2
PO
Figure 6.1: Part of the optimized program without considering debugging. The original program can be optimized to execute in 5 clock cycles. Part of the optimized program (only for a state variable D2) is shown in Figure 6.2. Almost all variables in source program disappear in optimized program. For example, the variables x and y in the source program have disappeared in the optimized program. It takes 3.575 operations on average on a workstation to retrieve any intermediate variable in a source program, with the assumption that the values of all state variables for the current iteration are known. In addition to the high debugging time, debugging is performed entirely on a source program rather than its optimized version. Our proposed DfD method produces an optimized program which can execute in 6 clock cycles, while ensuring faster and more complete debugging of the optimized program. Part of the optimized program is shown in Figure 6.3. The golden cut chosen for our method is shown in Figure 6.1 labeled
119
Clock Cycle
PI1
PI2
D1@1
D2@1
D3@1
*
*
*
*
*
1 2 3 4
+
+ + +
5
D2
Figure 6.2: Part of the optimized program by our proposed design-for-debugging method. as Golden Cut A. It takes 1.125 operations on average on a workstation to retrieve any intermediate variable in a source program. If we choose the golden cut labeled as Golden Cut B in Figure 6.1, it takes 1 operation on average on a workstation to retrieve any intermediate variable in a source program while the optimized program executes in 8 clock cycles on a system-on-silicon. In this example, we have shown that the debugging of optimized program can be performed efficiently and thoroughly with minimal loss of optimization potential by the proposed DfD method.
6.2
Computational and Hardware Model
We represent a computation by a hierarchical control data flow graph (CDFG) consisting of nodes representing data operators or sub-graphs, and edges representing the data, control, and timing precedences [Rab91]. The computations operate on periodic semi-infinite streams of inputs to produce semi-infinite streams of outputs. The underlying computational model is homogeneous synchronous
120
Clock Cycle
PI1
+
1 2
D1@1
PI2
D3@1
D2@1 +
+ *
3 4 5
*
* +
6
D2
Figure 6.3: A motivational example for the proposed design-for-debugging method. data flow model [Lee87] which is widely used in computationally intensive applications such as image and video processing, multimedia, speech and audio processing, control, and communications. We do not impose any restriction on the interconnect scheme of the assumed hardware model at the RT-level. Registers may or may not be grouped in register files. Each hardware resource can be connected in an arbitrary way to another hardware resource. The initial design is augmented with additional hardware which enables controllability in the “debugging” mode. The following input operation is incorporated to provide complete controllability of a variable Var1 using user specified input variable: Input1: if(Debug) then Var1 = Input1. The problem of setting breakpoints is handled in the following way. A breakpoint can be set in any variable such that the execution of the program must stop immediately after performing the operation producing the breakpoint variable. Since the optimized code instead of the source code is running usually on multiprocessors, the problem of determining when to stop the execution of the
121
optimized code for a breakpoint set in the source code is not straightforward. If the variable set as a breakpoint exists in the optimized code, the execution of the optimized code stops immediately after the control step which produces the variable. If not, we stop the execution of the optimized code immediately after the control step producing any variable which exists in both the source and optimized codes and depends on the breakpoint variable. If any one of the variables depending on the breakpoint variable is computed, then the breakpoint variable has already been computed.
6.3
Design for Symbolic Debugging
In response to a user inquiry about a source variable x in the source CDFG, we first need to determine if the variable x exists in the optimized CDFG. This step can be efficiently performed by keeping a list of variables that exist in both the source and optimized CDFGs. If the variable x exists in the optimized CDFG, we need to confirm if the value of the variable x is still stored in a register. Due to register sharing, the register holding the variable x may be storing a different variable at the time of the inquiry. This can be handled by checking the schedule of variables for registers. At the time of the inquiry, only the variables stored in the registers are available. If any one of the answers is negative, then the variable needs to be computed from the golden cut.
6.3.1
Selection of Optimal Golden Cuts
Our proposed method requires that the golden cut should be chosen to result in minimum debugging time, optimal design metrics and as complete debugging of optimized program as possible. The last requirement stems from the fact that
122
our method executes part of the source program to get the value of a source variable in request. Because our goal is to debug optimized program, the part of the source program should be minimal. Several conflicting requirements about a golden cut can be identified. First, a golden cut should be as small as possible in order to minimize the disruption of the optimization potential of optimization techniques. Second, a golden cut should not be too small in order to minimize the debugging time. For example, an empty golden cut is the smallest golden cut that will minimize the disruption of the optimization potential, but it will result in an optimized code with long debugging time. Finally, a golden cut should be as large as possible to ensure the complete debugging of the optimized code. This requirement is satisfied by the golden cut with all the variables in the source CDFG, which results in no optimization potential to be realized. Therefore, a golden cut should be chosen by balancing all these conflicting requirements. We consider the problem of finding the smallest complete golden cut such that every source variable can be computed by at most k operations starting from the golden cut. More formally, the problem can be defined as the following: Problem: Given a directed acyclic hypergraph H(V, E), find the smallest subset of edges, E such that for every edge e ∈ E, a cone c of e with respect to E has at most k nodes, where a cone c of e with respect to E is a subset of nodes consisting of nodes on paths from all edges in E to e. The source program can be described by a directed acyclic hypergraph due to the requirement that a complete golden cut be chosen within one iteration of the computation. Note that the source and optimized programs in the motivational example are described by a directed acyclic hypergraph. The pseudocode of the basic heuristic for the golden cut problem is provided in Figure 6.4. Intuitively the heuristic inserts “pipeline stages” in the hypergraph
123
H so that the number of edges with pipeline registers is minimized and the size of the cone for each edge is less than or equal to k. The pipeline stages are inserted in sequence. Once a stage is inserted, it stays fixed. Let |cE (e)| denote the size of the cone for the edge e with respect to E . When calculating |cE (e)|, we need to traverse the graph once for each edge. Thus, O(|V ||E|) steps are required for each pipeline stage insertion. A minimum cut for the subgraph with only green edges and their incident nodes can be optimally computed in polynomial time by a maximum flow algorithm, based on the Maxflow min-cut theorem [Cor90]. Using the method proposed by Yang and Wong [Yan94], the flow network for the subgraph is constructed as the following: • For each hyperedge n = (v; v1 , · · · , vl ) in the subgraph, add two nodes n1 and n2 and connect an edge (n1 , n2 ). For each node u incident on the hyperedge n, add two edges (u, n1 ) and (n2 , u). Assign unit capacity to the edge (n1 , n2 ) and infinite capacity to all other added edges. (see Figure 6.5) • A “dummy” source node s and a “dummy” sink node t are added to the subgraph. From the source node, we add edges with infinite capacity to all the source nodes in the original subgraph. We also add edges with infinite capacity from all the sink nodes in the original subgraph to the sink. The construction process for an example graph is shown in Figure 6.6. A minimum cut of the constructed flow network can be found using various approaches such as the O(|V ||E|)-time algorithm in [Yan94]. We use linear programming for constructing a flow network by relying on a public domain package lp solve [LP]. All the “saturated” edges in the constructed flow network are added to the golden cut. To avoid trivial solutions, we use the lower bound l. The constant l is experimentally determined so that high quality golden cuts are obtained.
124
Given: a directed acyclic hypergraph H(V, E) and constants l and k E = ∅ Repeat Calculate |cE (e)| of all edges after the most recently inserted pipe stage. If all |cE (e)| ≤ k break Mark as “green” the edges with l ≤ |cE (e)| ≤ k. Construct a flow network for the subgraph with only green edges and their incident nodes. Find a minimum cut of the flow network using a maximum flow algorithm. E ← E ∪ {edges of the new cut}. Return E
Figure 6.4: The pseudocode of the basic heuristic for the golden cut problem. Of course, the previous insertions of the pipeline stages will affect the quality of the subsequent insertions. Therefore, to further improve the heuristic, we employ the iterative improvement using the heuristic slightly modified from one described in Figure 6.4 as a search engine. The heuristic described in Figure 6.4 is modified such that the constant l is not fixed and its value is randomly chosen between 1 and k for each pipeline stage insertion. Let P ipeline(H, k) be the modified heuristic for the hypergraph H with a constant k. Let |E | be the v1
v1 v
n
v
n1
v2
1
n2 v2
Figure 6.5: Modeling a hyperedge in flow network.
125
v2 s
v1
a1
1
b1
1
b2
a2
v4 v3
c1
1
t
c2
Figure 6.6: The construction process of a flow network for the “green” subgraph: the flow network. number of edges in the golden cut E . The iterative improvement heuristic based on the heuristic P ipeline(H, k) is described in Figure 6.7. Given: a directed acyclic hypergraph H(V, E) and constant k Minimum Cut = ∞ Repeat E = P ipeline(H, k) If |E | < Minimum Cut Minimum Cut = |E | Golden Cut = E Until no improvement in c consecutive iterations Return Golden Cut
Figure 6.7: The pseudocode of the iterative improvement heuristic for the golden cut problem.
6.4
Experimental Results
We applied our approach to design for symbolic debugging on a set of 10 small industrial examples as well as two large design examples. The smaller designs include a set of Avenhaus, Volterra, and IIR filters, an audio D/A converter, and an LMS audio formatter. Table 6.1 presents the experimental results for the small
126
designs. We define query time as an expected time to retrieve any variable in the source program. The time is measured as average number of operations that needs to be executed for retrieving the value of a variable. Table 6.1 is obtained from the constraint that the value k for the linear program is set such that the final query time is 50%, 25%, or 12.5% of the initial query time). The average golden cut size with respect to the number of variables was 4.99%, 10.49%, and 19.26%, respectively. The two large designs include the JPEG codec from the Independent JPEG Group and the European GSM 06.10 provisional standard for full-rate speech transcoding, prI-ETS 300036, which uses residual pulse excitation/long term prediction coding at 13 kbit/s. Table 6.2 presents the experimental results for the large designs. For the same set of query time constraints, the average golden cut size with respect to the number of variables was 2.83%, 6.07% and 12.72%, respectively. None of the examples resulted in run-times of the linear programmer larger than a minute.
6.5
Conclusion
We addressed the problem related to the retrieval of source values for the globally optimized behavioral specifications. We explained how transformations affect the retrieval of source values. We presented an approach for a symbolic debugger to retrieve and display the value of a variable correctly and efficiently in response to a user inquiry about the variable in the source specification. The implementation of the new debugging approach posed several optimization tasks. We formulated the optimization tasks and developed efficient algorithms to solve them. The effectiveness of the proposed approach was demonstrated on a set of designs.
127
Variables
G. Cut
G. Cut
G. Cut
Design
in CDFG
Size 1
Size 2
Size 3
12th order IIR
56
3
5
9
Avenhaus direct
40
2
5
9
Avenhaus cascade
34
2
4
8
Avenhaus parallel
39
2
5
9
Avenhaus continued
35
2
5
9
Avenhaus ladder
50
3
6
11
DAC
354
7
15
28
2nd order Volterra
29
2
4
7
3rd order Volterra
50
3
5
9
LMS formatter
464
9
21
45
Table 6.1: Golden Cut Sizes 1, 2, and 3 are obtained for values of k in the linear program, such that the final query time is 0.5, 0.25, and 0.125 of initial query time, respectively.
Variables
G. Cut
G. Cut
G. Cut
Design
in CDFG
Size 1
Size 2
Size 3
JPEG encoder
4806
120
234
501
JPEG decoder
4269
105
229
453
GSM encoder
3291
98
206
417
GSM decoder
2556
87
199
439
Table 6.2: Golden Cut Sizes 1, 2, and 3 are obtained for the value k in the linear program, such that the final query time is 0.5, 0.25, and 0.125 of initial query time, respectively.
128
CHAPTER 7 Non-Intrusive Symbolic Debugging of Optimized Behavioral Specifications Symbolic debuggers are system development tools that can accelerate the validation speed of behavioral specifications by allowing a user to interact with an executing code at the source level. In response to a user query, the debugger must be able to retrieve and display the value of a source variable in a manner consistent with what the user expects with respect to the source statement where execution has halted. However, when a behavioral specification has been optimized using transformations, values of variables may either be inaccessible in the run-time state or inconsistent with what the user expects. In this paper, we address the problem that pertains to the retrieval of source values for globally optimized behavioral specifications. We have developed a set of techniques that, given a behavioral specification CDF G, enforce computation of a selected subset Vcut of user variables such that (i) all other variables v ∈ CDF G can be computed from Vcut and (ii) this enforcement has minimal impact on the optimization potential of the computation. The implementation of the new debugging approach poses several optimization tasks. We have formulated the optimization tasks and developed heuristics to solve them. The effectiveness of the proposed approach has been demonstrated on a set of benchmark designs.
129
7.1
Introduction
Functional debugging of hardware and software systems has emerged as a dominant step with respect to time and cost of the development process. For example, debugging (architecture and functional verification) of the UltraSPARC-I took twice as long as its design [Yan95]. The difficulty of verifying designs is likely to worsen in the future, since the key technological trends indicate that the percentage of controllable and observable variables in designs will steadily decrease. The Intel development strategy team foresees that a major design concern for their year-2006 microprocessor will be the need to exhaustively test all possible computational and compatibility combinations [Yu96]. Symbolic debuggers are system development tools that can accelerate the validation speed of behavioral specifications by allowing a user to interact with an executing code at the source level [Hen82]. Symbolic debugging must ensure that in response to a user inquiry, the debugger is able to retrieve and display the value of a source variable in a manner consistent with what the user expects with respect to a breakpoint in the source code. The application of code optimization techniques usually makes symbolic debugging harder. While code optimization techniques such as transformations must have the property that the optimized code is functionally equivalent to the unoptimized code, such optimization techniques may produce a different execution sequence from the source statements and alter the intermediate results. Debugging unoptimized rather than optimized code is not acceptable for several reasons: • while an error in the unoptimized code is undetectable, it is detectable in the optimized code, • optimizations may be necessary to execute a program due to memory lim-
130
itations or other constraints imposed on an embedded system, and • a symbolic debugger for optimized code is often the only tool for finding errors in an optimization tool. In this paper, we propose a design-for-debugging (DfD) approach that enables retrieval of source values for a globally optimized behavioral specification. The goal of the DfD technique is to modify the original code in a pre-synthesis step such that every variable of the source code is controllable and observable in the optimized program. More formally, given a source behavioral specification (represented as a control data flow graph [Rab91]) CDF G, the goal of the DfD approach is to enforce computation of a selected subset Vcut ∈ CDF G (cut) of user variables such that: • all other variables V ∈ CDF G can be computed from the cut Vcut (therefore Vcut represents a cut of the computation - presented in Chapters 4 and 5), and • the enforcement of computation of user defined Vcut variables has minimal impact on the optimization potential of the computation. The original code can be optimized with respect to target design metrics such as throughput, area, and power consumption. It is important to stress that finding a cut of a computation has been addressed in many debugging [Kir99] and software checkpointing [Ziv98] research works. However, symbolic debugging imposes a new constraint for the cut selection procedure: variables enforced to be computed should not restrict qualitatively the optimization process. The developed DfD technique analyzes the source computation and selects the cut variables according to a number of heuristic policies.
131
Each policy quantifies the likelihood that a particular variable is not computed due to a specific transformation of the computation [Hon97]. In order to support fully modular pre-processing, explicit computation of selected cut variables Vcut is enforced by assignment of each variable vi ∈ Vcut to a primary output. Thus, application of any synthesis tool would result in an optimized behavioral specification CDF Go , which necessarily contains the selected cut variables Vcut ∈ CDF Go . At debugging time (simulation or emulation), the symbolic debugger monitors the values of cut variables. In response to a user inquiry about a source variable vi ∈ Vcut , CDF Go and vi ∈ CDF G, all the variables in the cut that the variable vi depends on, are determined by a breadth-first search of the source CDF G with reversed arcs. Using these values, variable vi is computed using the statements from the original CDF G. The developed symbolic debugging technique poses a number of optimization tasks. In this manuscript we define these tasks, establish their complexity, and propose heuristic techniques for their solution. The effectiveness of the developed DfD methodology has been tested using several benchmark designs.
7.1.1
Motivational Example
Using the following motivational example, we show that debugging of an optimized behavioral specification can be performed efficiently and thoroughly with minimal loss of optimization potential by the proposed DfD method. For brevity and expressiveness, we have constructed an abstract computation, illustrated in Figure 7.1, which demonstrates the trade-offs involved in selecting cut variables for symbolic debugging. The CDFG of the constructed computation is depicted in Figure 7.1. Expected optimization steps are applied to shaded areas in the figure as follows:
132
• additions A1 through A4 can be compacted in a tree of additions for critical path reduction (throughput optimization), • number of multiplications can be reduced by applying the distributivity rule to multiplications M 1 and M 2 and addition A5 (area optimization), and • all operations in the remaining shaded area can be optimized according to the transformations for achieving fast linear computation [Pot92].
IN1
D2
D1
IN2
A1
M2
A2
Critical path reduction
A3
Distributivity for reduction of number of multiplications
M1
C1
IN3
D3
M3
A5
IN4
A6
IN5
A4 A7 M5
M4
C2
Simplification of linear computations [Pot92]
A8
A9
C3
Cut variables GC
A12 C4 A10 A11 A13
OUT
A14
D1
D2
D3
Figure 7.1: An example of trade-offs involved in selection of cut variables such that optimization potential of the computation is not impacted.
133
We define an important concept that enables effective design-for-debugging. A golden cut is defined as a subset of variables in the source code, which should be correct [Hen82] in the optimized program. A complete golden cut Vcut is a golden cut with the property that all user variables and primary outputs in the computation can be computed using only the cut variables and the primary inputs. Alternatively, a complete golden cut is a subset of variables which bisects all cyclic paths in the control data flow graph of a computation [Kir99]. Variables of an example complete golden cut GC (output of M 3, M 5, and A6) are depicted in Figure 7.1. Using the variables in GC and the primary inputs, any other variable in the computation can be computed. For example, consider the input variables of addition A14. Bold lines in Figure 7.1 illustrate the sequence of operations to be executed in order to calculate the results of A11 and C3 solely by using the variables in the cut. Since the selected cut variables are not a result of operations that can be involved in the abovementioned optimizations, their selection yields efficient symbolic debugging accompanied with effective design implementation. Conversely, a highly inefficient cut can be constructed using the output variables of M 1, M 2, A2, A9, C2, and A8. Besides larger cardinality of the involved subset of variables, this unfortunate selection also disables all possible optimizations, thus resulting in a poor implementation. In general, all possible optimizations are not known in the pre-processing DfD phase. Therefore, in this paper we propose a cut selection process that is guided using heuristics that determine the likelihood that an operation can be involved in a transformation.
134
7.2
Computational and Hardware Model
It is important to stress that the developed symbolic debugging approach is not limited to a specific computation model. However, for each computation model, the definition of a complete golden cut has to be established to satisfy the generic concept: a golden cut at time T is defined as a subset of variables from which any other variable computed after T can be computed. For the sake of conceptual simplicity, in this work, we target the synchronous data flow (SDF) model of computation [Bha93]. This simplification is assumed because of three reasons: brevity, availability of synthesis tools, and the fact that the SDF computation model corresponds to many data-intensive multimedia, communications, and wireless applications. In our experiments, we have used Silage [Rab91] as a specification language for ASIC implementation. We assume fully deterministic behavior of hardware and a continuous semiinfinite operation mode (not necessarily periodic). We do not impose any restriction on the interconnect scheme of the assumed hardware model at the registertransfer level. Registers may or may not be grouped into register files. Each hardware resource can be connected in an arbitrary way to other hardware resources. We do not impose any restrictions on the number of pipeline stages of the employed functional units. The design is fully specified and its functionality and realization is not disturbed by the debugging process, with the exception of enabling the user to write into specific controllable registers.
7.3
Design for Symbolic Debugging
In this section, we present the technical details behind the developed symbolic debugging technique. First, we describe how our approach is incorporated in a
135
standard debugging engine. Then, we identify the optimization goals, establish the complexity of involved synthesis tasks, and finally, we propose a heuristic for fast and effective selection of complete golden cuts.
7.3.1
Debugging Optimized Behavioral Specifications
The global flow of the debugging process is depicted in Figure 7.2. As a compilation pre-processing step, the developed DfD technique analyzes the original behavioral specification CDF G in order to select a complete golden cut GC, which is optimization-friendly. Upon selection, the DfD procedure augments the original specification with statements that enforce computation of golden cut variables. If the DfD approach is part of an optimizing compiler, this step can be performed by marking variables. An independent modular DfD technique would achieve the same goal by specifying the golden cut variables as output variables. Once computation of the golden cut variables is assured, the modified behavioral specification CDF Gm is processed by a synthesis tool. The result of this process is an optimized behavioral specification CDF Go with guaranteed existence of golden cut variables. While monitoring code execution, the symbolic debugger scans for values of golden cut variables and stores them in designated buffers. Since the computation of a single source variable may involve values of golden cut variables from several iterations (see Chapter 5), the depth of each buffer can be larger than one. The expectation is that the cardinality of cut variables should be much smaller than the cardinality of variables in the source CDF G [Kir99]. Therefore, the memory overhead for golden cut maintenance is in general low. While debugging, at a specific breakpoint the user inquires about a source variable vi in the source CDF G. Initially, the symbolic debugger determines if
136
vi exists in the optimized CDF Go . This step can be efficiently performed by keeping a list of variables that exist in both the source and optimized CDFGs. If the variable does not exist in the optimized code CDF Go , then its value is computed from the golden cut. All the variables in the cut that the variable vi depends on, are determined by a breadth-first search of the source CDF G with reversed arcs. Finally, we compute variable vi using the cut values and the statements from the original CDF G.
CDFG
Variable X = ?
Search for complete golden cut
X in CDFG o?
yes no Depth-first search to determine subset S of GC on which X depends.
Design-for-Debugging
GC
Specification augmentation
Compute X from S by performing operations in CDFG
CDFG m
Synthesis
Print value of X
Symbolic Debugging
CDFG o
Figure 7.2: Global flow of the DfD and symbolic debugging process.
137
7.3.2
Selection of Optimal Golden Cuts
The effectiveness of the developed symbolic debugging approach depends strongly on the selection of golden cut variables. In this section, we identify the trade-offs involved in golden cut determination under different optimization constraints. Next, we establish the complexity of the cut selection problem and provide an algorithm for its solution. Finally, we discuss how certain transformations can affect the cut selection, resulting in cut invalidation. Definition of a Complete Golden Cut. A complete cut is a subset of variables which bisects all cyclic paths in the control data flow graph of a computation. The definition of a complete golden cut has been adopted from [Kir99], where cuts are used to transfer minimal computation states from simulation to emulation engines. Such a definition of a cut ensures that any variable in the original specification can be computed from its cut. However, it does not guarantee that the modified specification can be optimized as effectively as the original one. To address this issue, the search for a computation cut has to reflect the trade-offs involved with potential optimizations. The developed DfD approach does not assume that a particular optimization will be performed, but heuristically quantifies the likelihood that a particular variable will disappear during the optimization process. We propose a set of heuristics that identify variables that are likely to be used in generic, area, and throughput optimizations. Low-power constraints can be usually described as a superposition of transformations for area and throughput [Dey99]. The set of criteria for optimization-sensitive cut selection is incorporated into the search process using an objective function Φ(vi , CDF G). This function attempts to quantify for each variable vi the likelihood that vi disappears during the synthesis process.
138
Φ(vi , CDF G) =α · |GC|+ f anout(vi )+ testLinear(vi )+ test 1 CP (vi )+ testDistributivity(vi )+ β·
max(overlap(vi ,N (vi ))) + LT (vi )·LT (N (vi ))
(the scheduling constraint of vi ) testP arallelism(vi )+ testInputsInCycles(vi ) The components of the objective function return quantifiers that represent the trade-offs involved in decision making for inclusion of a variable in a complete golden cut. Values of quantifiers are determined experimentally in a learning process or according to designer’s experience and optimization goals. In our experiments, we have used the meta-algorithmics parameter tuning procedure [Kir97b]. Each component corresponds to the following generic optimization objective: • |GC| - Small cardinality of the golden cut. • f anout(vi ) - Operations with high fanout. If the result vi of an operation Oi is used as an operand in a relatively large number of different operations, then it is hard to apply transformations to the original CDF G such that vi disappears from it. Thus, it is highly desirable to include vi in a complete golden cut. • testLinear(vi ) - Non-linear operations. It has been demonstrated that a collection of linear operations (addition, subtraction, multiplication with a
139
constant, etc.) can be transformed optimally for a particular design metric [Pot92]. Therefore, operands or results of non-linear operations should be given preference for inclusion in a golden cut. • test 1 CP (vi ) - 1-critical path. During transformations for almost all design metrics, the critical path of the computation is frequently severely modified. This reasoning stems from the fact that the critical path usually limits performance of a circuit. Therefore, the golden cut selection routine should avoid including variables at most 1 operations close to the critical path. The following optimization indicators have been considered for area minimization: • testDistributivity(vi ) - Enable distributivity. Low priority for cut selection is given to variables that can be involved in applying distributivity among operations. Since distributivity is the key enabler of reducing expensive operations, such as multiplications or divisions, it is of utmost importance not to disable this transformation. • Disable scheduling expensive operations in the same control step. Special attention is paid to expensive operations that have short and overlapping life-times LT (vi ) (LT (vi ) = ALAP (vi ) − AEAP (vi ); AEAP - as early as possible - and ALAP - as late as possible). We denote the subset of variables with lifetimes overlapping the lifetime of vi as N (vi ). If the overlap overlap(vi , N (vi )) among them is relatively large with respect to the lifetimes of considered variables, then the cut selection procedure should avoid inclusion of vi in the cut. • testP arallelism(vi ) - Enable reduction of parallelism.
Transformations
such as loop folding and loop merging modify the computation in such
140
a way that blocks of parallel operations are merged into a single instruction block with lower degree of parallelism. Obviously, reduction of parallelism can reduce circuit’s area at the expense of increased clock speeds. In our experiments, we have considered only one transformation for maximizing throughput: • testInputsInCycles(vi ) - Number of inputs in cycles. Cycles with higher number of primary input variables have to be carefully cut, since input operations can be commonly extracted from the loop and processed as a highly pipelined structure. This transformation can significantly increase the throughput of the system. The problem of finding a complete golden cut, that obeys the requirements of all optimization goals, can be defined formally using the following standard format. PROBLEM: The Complete Golden Cut. INSTANCE: Given an unscheduled and unassigned control data flow graph CDF G(V, E) with each node vi weighted according to Φ(vi , CDF G) and real number K. QUESTION: I s there a subset of variables GC, such that when removed from the CDF G leaves no directed cycles and the sum of weights
vi ∈GC
Φ(vi , CDF G) is
smaller than K? The specified problem is an NP-complete problem since there is an one-to-one mapping between the special case of this problem, when the weights on all nodes are equal, and the FEEDBACK ARC SET problem (GT8, pp.192, [Gar79]). The developed heuristic algorithm for this problem is summarized using the pseudocode in Figure 7.3. The heuristic starts by logically partitioning the graph into a set of strongly connected components (SCCs) using the breadth-search algorithm
141
[Cor90]. This algorithm has complexity O(V + E), where V is the number of vertices and E is the number of edges in a graph. All trivial SCCs which contain exactly one vertex are deleted from the resulting set since they do not form cycles. Then, the algorithm iteratively performs several processing steps on each of the non-trivial SCCs. Create a set SCC = ComputeScc(CDF G(V, E)) of strongly connected components [Cor90] For each SCCi ∈ SCC If |SCCi | = 1 delete SCCi from SCC CUT = null While SCC = empty For each SCCi ∈ SCC GraphCompaction(SCCi ) For each node vj ∈ SCCi S = ComputeScc(SCCi − vj ) S OF (S, vj ) = i=1 Φ(vj , CDF G) End For Select vertex vj which results in max OF (S, vj ) Delete vj from SCCi SCC = S(Vj ) For each SCCi ∈ SCC If |SCCi | = 1 delete SCCi from SCC End For CU T = CU T ∪ vj End For End while
Figure 7.3: Pseudo-code for the developed algorithm for The Complete Golden Cut problem. At the beginning of each iteration, to reduce the solution search space, a graph
142
compaction step is performed. In this step each path P : A → B which contains only vertices V ∈ P, V = A with exactly one variable input is replaced with a new edge EA,B which connects the source A and destination B and represents an arbitrary selected edge (variable) of the same path. Nodes A and B inherit the maximum weight among its current weight and all the nodes removed from the CDFG due to the compaction process using edge EA,B .
a) Control data flow graph of a third order Gray Markel ladder filter
A1
A3
C1
+
+
IN
+
C2
A7 +
A5 +
C3
+
A6 +
A4
A2
A
A9
A8 +
D1
+
D2
C4
C5
C7 +
+
A10
A11
D3 C6 A12 OUT
+
b) An example of graph compaction. A
V is the node considered for deletion W W
V B
A
V
Compaction of A and B
c) An example of edge deletion and creation of new SCCs.
Bold edges and nodes represent the remaining SCCs when V is deleted
Figure 7.4: Performing the steps of a single iteration of the cut-set selection procedure. In the next step, the algorithm decides which node (variable) in the current set of SCCs is to be deleted. The algorithm makes its decision based on the cardinality of the newly created set of SCCs and the sum of objective functions of the currently selected cut. The vertex that results in the largest overall objective function is removed from the set of nodes as well as all adjacent edges. The deleted vertex is added to the resulting cut-set. The process of graph compaction, evaluation of node deletion, node deletion, and graph updating is repeated until the set of non-trivial SCCs in the graph is empty. The set of nodes (variables) deleted from the computation represents the final cut-set selection.
143
Consider the example shown in Figure 7.4. The CDFG of the third order Gray-Markel ladder IIR filter, shown in Figure 7.4a, has only one non-trivial SCC. The graph compaction step is explained in Figure 7.4b, where vertex B is merged with vertex A as well as variable W is merged with variable V . In Figure 7.4c an example of node deletion is described. The deleted node creates two smaller SCCs.
7.3.3
Discussion of Cut Validity After Applying Transformations
Once the DfD procedure modifies the source code cdf g m = Df D(cdf g), a synthesis tool ST is applied in order to generate the final optimized specification cdf g o = ST (cdf gm ). In general, the synthesis tool should have the freedom to perform arbitrary transformations on the source computation. The question that can be posed is: does there exist such a set of transformations ST , which translates the source specification cdf g m with an enforced complete golden cut GC into a new specification cdf g o , where the enforced cut GC is not a complete cut? This question can be answered from two prospectives. • Several examples of computation structures of different implementations of the same computational functionality (for example: the Gray-Markel ladder, cascade, parallel, elliptic, and direct-form IIR filters) clearly indicate that, generally, there exist such transformations that enforce a given cut in one specification not to satisfy the cut properties in the transformed specification. However, the sophistication of such algorithmic transformations is far from being met by any published synthesis tool. Therefore, it is not expected that the structure of the computation is changed drastically during optimization.
144
• There exist transformations performed by common compilers (such as loop fusion, splitting, folding, and unfolding), which modify the loop structure of the computation. However, all of these transformations preserve the completeness of a cut selected in the DfD phase.
7.4
Experimental Results
Proper evaluation of the proposed debugging techniques is a complex problem due to a great variety of optimization steps which can be undertaken during design optimization using transformations. In addition, it is well known that the effectiveness of transformations is greatly dependent or the order in which they are applied. The situation is further complicated by a great variety of designs. For example, design can vary tremendously in their size, type of used operations, and cycle and in general topology structure. In order to address this concern, we have applied the new technique on more than hundred designs from the Hyper [Rab91] and Mediabench benchmark suites [Lee97]. We have used the following transformations: associativity, commutativity, distributivity, zero and inverse element laws, retiming, pipelining, loop unfolding and folding, constant propagation, substitution of constant multiplication with shifts and additions, and common subexpression elimination and replication. In addition, we have used several popular scripts for transformation ordering, such as one which guarantees the maximal throughput when applied to linear computations. On the overwhelming number of designs, our technique did not incur any cost, regardless of targeted optimization goals: area, throughput, or power. The detected exceptions are tabulated in Table 7.1. On these examples, we
145
Design
ICP
OCP
GC
IArea
OArea
Area OH
dist
7
4
4
7.96
8.23
3.5%
chemical
6
3
3
25.56
26.33
3%
5WDF
17
5
5
81.55
85
4%
7IIR
10
4
7
42.58
51.95
22%
avenhaus
11
5
4
49.90
60.38
21%
10IIR
12
5
5
55.42
68.15
23%
11IIR
17
5
5
66.26
75.53
14%
band-pass
20
5
6
172.45
214
24%
noise-shaper
29
6
9
233.97
325.7
39%
modem
25
6
11
238.01
330.3
40%
DAC
58
3
3
42.99
43.09
0.2%
Table 7.1: Comparison of areas of designs optimized with and without the DfD phase. ICP - initial critical path; OCP - critical path after optimization; GC - cardinality of the complete golden cut; IArea - optimized design area without DfD; OArea - optimized design area with DfD; Area OH - is the overhead in area incurred due to pre-processing for symbolic debugging. applied retiming for joint optimization of latency and throughput, and then, maximally fast script for linear computations. The designs augmented with additional debugging constraints were able to produce the best combination of latency and throughput. However, on some of them notable area overhead was induced due to the added constraints. Closer analysis of this examples, indicates that the symbolic constraints induced a need for computation of additional variables used only for debugging purposes. The used combination of transformations, drastically changed the structure of computations, so that the initial selection of cut resulted in a need for significant additional computation. It can be concluded that although it is possible to find examples with additional overhead due to enforced
146
computation of the golden cut, such cases occur rarely and they are commonly associated with application of rather complex and sophisticated transformation scripts for optimization of complex objective functions. Such design objectives are desired rarely in modern design practice.
7.5
Conclusion
In response to a user query, a symbolic debugger must be able to retrieve and display the value of a source variable in a manner consistent with what the user expects with respect to the source statement where execution has halted. However, when a behavioral specification has been optimized using transformations, values of variables may either be inaccessible in the run-time state or inconsistent with what the user expects. In this paper, we propose a set of techniques that, given a behavioral specification, enforces computation of a selected subset of user variables such that all other variables can be computed from this subset and this enforcement has minimal impact on the optimization potential of the computation. The implementation of the new debugging approach poses several optimization tasks. We have formulated them and developed heuristics for their effective solution. The effectiveness of the proposed approach has been demonstrated on a set of benchmark designs.
147
CHAPTER 8 Engineering Change: Methodology and Application to Behavioral and System Synthesis Due to the unavoidable need for system debugging, performance tuning, and adaptation to new functionalities and standards, the engineering change (EC) methodology has emerged as one of the crucial components in synthesis and debugging of systems-on-chip. Although EC received a great deal of attention, until now, these efforts were mainly ad-hoc and unrelated to the design process. We introduce a novel design methodology which facilitates design-for-EC and post-processing to enable EC with minimal perturbation. Initially, as a synthesis pre-processing step, the original design specification is augmented with additional design constraints which ensure flexibility for future correction. Upon alteration of the initial design, a novel post-processing technique achieves the desired functionality with a near-minimal perturbation of the initially optimized design. The key contribution we introduce is a constraint manipulation technique which enables reduction of an arbitrary EC problem into its corresponding classical synthesis problem. As a result, in both pre- and post-processing for EC, classical synthesis algorithms can be used to enable flexibility and perform the correction process. We demonstrate the developed EC methodology on a set of behavioral and system synthesis tasks.
148
8.1
Introduction
Due to the increasing complexities of modern systems-on-chip and more segmented design flows [Sha95], engineering change (EC) has recently emerged as the key enabling technology for shortening the time-to-market. The applicability of EC ranges from system debugging, performance tuning, adaptation to new functionalities and standards, and even low-power design [Buc97]. The fundamental goal for any set of EC tools is to provide the designer with the ability to easily perform functional or timing changes on the design, while minimally altering its specification throughout all levels of abstraction. In the case of RTL or logic network descriptions, a small change in the specification may result in significant perturbations of the underlying optimized structures (e.g. layout) [Fan97]. These consequences are highly undesirable because, in the case of fabricated circuits, modifications are performed using: • Mask updating and refabrication. • Spare logic: the designer stores spare logic in unused portions of the chip. If an error is detected, using a Focused Ion Beam (FIB) [Tho68] apparatus for cutting and implanting new wires on a die, this logic can be utilized for error correction. Similar effects can be made by allocating memory cells to multiplex a set of wires allocated for EC. • Electron beam lithography: The FIB apparatus can be combined with electron beam lithography (EBL) to create a complete system for rewiring and implanting logic structures into an already fabricated design [Tho68]. There are two fundamental approaches to EC: design-for-EC, where a certain amount of logic or programmable interconnects with no effect on the functionality
149
and timing constraints is augmented into the design before compilation; and postprocessing, where, knowing the correct functionality of the design, the optimized design is minimally altered such that the error is corrected. While the goal of the first technique is to anticipate which extra hardware might be useful in the case of an alteration, the second one has the difficult task of using a limited amount of resources to update the optimized design with minimal hassle. Although a number of techniques which address EC have been developed, until now, these efforts were mainly ad-hoc and unrelated to the design process. We introduce a new design methodology which facilitates both design-for-EC and post-processing to enable EC with near-minimal perturbation. Initially, as a synthesis pre-processing step, the original design specification is augmented with additional design constraints which ensure flexibility for future alteration. After the optimization algorithm is applied to the modified input, the added constraints impose a set of additional functionalities that the design can also perform. Upon diagnosis of an alteration in the initial design, a novel post-processing technique, which also facilitates constraint manipulation, achieves the desired functionality with a near-minimal perturbation of the optimized design. The key contribution which we introduce is a generalized constraint manipulation technique which enables reduction of an arbitrary EC problem into its corresponding classical synthesis problem. As a result, in both design-for-EC and post-processing, classical synthesis algorithms can be used to enable flexibility and then perform the correction process. That is in opposition to the currently adopted research model for EC problems which seeks for new synthesis solutions. The problem of EC has initiated research activity mainly in the logic synthesis domain. However, due to the increasing complexity of behavioral specifications and increasing number of stages in the current golden reference [Gat94] and water-
150
fall [Sha95] design flow models, designers are commonly faced with modifications which span over a number of design stages. In order to provide connectivity for EC through the entire design process, we demonstrate the developed EC methodology on a set of behavioral and system synthesis tasks. It is important to stress that all developed EC techniques can be applied to synthesis problems at all levels of design abstraction (e.g. logic synthesis, layout).
8.1.1
Motivational Example
In this subsection, we demonstrate how constraint manipulation can be used to enable design flexibility for EC as well as aid in performing the EC process solely on the updated portion of the design using an off-the-shelf synthesis tool. To present the design-for-EC paradigm, we use an example CDFG shown in Figure 8.1(a). The CDFG has been allocated two different hardware set-ups: one with 3 adders and 1 subtracter; and the other one with 2 adders and 2 subtracters. Both hardware set-ups satisfy the requirement of executing all operations of the CDFG within 5 control steps. Possible scheduling solutions are presented for both set-ups in Figures 8.1(b) and 8.1(c). Assuming the error model, where an addition is mistaken for subtraction or vice versa, the two allocation solutions present different resilience to errors. The solution in Figure 8.1(b) cannot be changed to support any error where a subtraction is mistaken for addition nor when subtraction v is an addition in the corrected spec. Allocation presented in Figure 8.1(c) can sustain any single operation error, as well as majority of double errors. Only when both subtractions u and v are corrected to additions, there does not exist any scheduling solution. The synthesis solution can be optimized during synthesis for near-minimal hassle for EC. Consider the two scheduling solutions presented in Figure 8.1(d).
151
1 2
a
+
b
+
u
-
v
-
+ a b
+
c e
d f
+ a b c e f
+
x
5
1 2 3 4 5
y
e
+
x z
(a) An example CDFG.
c
+
d
+
+
f
+
g
x y z u v
g
-
-
z
3 4
-
Error turns into
+
all except (v)
+
turns into
-
none
(b) Schedule for resource allocation A and its resiliency to an error. Error
d u
turns into
+
all
+
turns into
-
all
-
turn into + +
+
2 3 4 5
+ a b c e f
+
x z
+
-
-
y
all
turn into - +
all
1 2
u
4
v
-
all except (u, v)
-
+ turn into
3
d g
Resilience
-
v
(c) Schedule for resource allocation B and its resilience to a single and double error.
1
Resilience
-
y
g
+
5
+ a b c e f
+
x z
y
d u g
v
(d) Two different schedules reveal two different resilience to an error.
Figure 8.1: Design-for-EC: two resource allocation and scheduling solutions with different resilience to errors.
152
Both of them correspond to the allocation in Figure 8.1(c). If addition v is mistaken for subtraction, then in the left scheduling solution there are three operations d, g, and v that have to be rescheduled. However, the right scheduling requires only operations g and v to be rescheduled. Therefore, in this case, the goal of the design-for-EC process is to ensure that one addition unit is idle at control step 4. (a) An example of a 3-colorable graph. 2
1
1
3
(b) Merger of nodes (constraints).
2
(c) Solution to the corrected graph with compressed constraints.
2
3 1,4,7 SUB
4
4
3,8
8 8
Edges added due to node merger.
7 7
5
11 11
6 Part of the graph to be corrected.
9
10
The correction.
9
10
Figure 8.2: An example engineering change application: performing graph coloring of a corrected specification only on the updated subgraph. The advantage of using constraint manipulation in performing EC is demonstrated on coloring graphs. This task corresponds to many resource allocation problems. An example graph is presented in Figure 8.2(a). Suppose the designer wants to replace nodes 5 and 6 with nodes 9, 10, and 11, while preserving the coloring of the remaining set of nodes SU B. Note that applying an off-the-shelf coloring algorithm to the corrected specification is not guaranteed to retrieve such solution. Instead of developing a new algorithm for this problem, we manipulate the constraints of SU B in such a way that all nodes in SU B colored with one color are merged into one node. This node inherits the edges of all included nodes. For example, as illustrated in Figures 8.2(b-c), nodes 4 and 7 are
153
merged with node 1. The resulting graph in Figure 8.2(c) can be colored with a traditional graph coloring routine, resulting in a correct global coloring of the updated specification where nodes in SU B are colored as in the initial coloring. Detailed descriptions of developed algorithms for both design-for-EC and EC of scheduling and coloring solutions is provided in Section 5.
8.2 8.2.1
Preliminaries Hardware and Computational Model
We have selected as our computational model the synchronous data flow (SDF) model [Lee87]. The SDF is a special case of data flow in which the number of data samples produced or consumed by each node on each invocation is specified a priori. Nodes can be scheduled statically at compile time onto programmable processors. We restrict our attention to homogeneous SDF (HSDF), where each node consumes and produces exactly one sample on every execution. The HSDF model is well suited for specification of single task computations in numerous application domains such as DSP, communications, and multimedia. The syntax of a targeted computation is defined as a hierarchical control-data flow graph (CDFG) [Rab91]. The CDFG represents the computation as a flow graph, with nodes, data edges, and control edges. The semantics underlying the syntax of the CDFG format, as we already stated, is that of the SDF model. The HSDF model was selected mainly because of availability of synthesis tools and therefore ease of collecting experimental data. All developed EC techniques can be applied successfully to other computation models such as the discrete event, communicating FSMs, synchronous/reactive, dataflow process network, and Petri net model [Edw97].
154
8.2.2
Targeted Behavioral Synthesis Tasks
Behavioral synthesis transforms a given behavioral specification into a RTL description that can implement a given behavior. It encompasses a variety of synthesis tasks, such as scheduling, allocation, binding, partitioning, module selection, and transformations. An overview of existing synthesis techniques can be found in [Gaj92, DeM94]. For the sake of brevity, we demonstrate the developed EC methodology only for two synthesis tasks: operation scheduling and resource allocation and assignment. Allocation determines the type and quantity of resources such as storage units, functional units, and interconnect units used in a data path. Assignment is a process of binding each operation to a functional unit, each variable to a storage unit, and each data transportation to an interconnect unit. Optimization goals may vary for various allocation problems. For example, in register allocation, the synthesis goal is not only to minimize the number of allocated registers but also to minimize the interconnect cost. Many register allocation algorithms for CDFGs that contain no loops, focus on either unconditional register sharing [Pau89] or conditional register sharing [Kur87]. For CDFGs with loops, Stok and van den Born [Sto89] proposed a method to break the loops at their boundaries such that variables whose lifetimes cross a loop boundary are split and treated as two separate variables. Scheduling is a process of partitioning a set of arithmetic and logical operations in the CDFG into groups of operations so that operations in the same group can be executed concurrently in one control step, while taking into consideration possible trade-offs between total execution time and hardware cost. In the scheduling step, the total number of control steps needed to execute all operations in the CDFG, the minimum number of functional modules, and the
155
lifetimes of variables are determined. The lifetime of a variable spans between the control step at which it is computed and the control step at which the last variable dependent on its value is computed. There are two basic approaches to scheduling: heuristics [Pau89] and integer linear programming [Hwa91].
8.3
The New EC Methodology
The complexity of modern application-specific systems has resulted in design flows which consist of a number of stages. The two most widely accepted design flows are the golden model and the waterfall model. The golden model is a copy of the design specification at some level of abstraction (usually RTL) at which most of the changes are performed [Gat94]. The underlying concept behind the waterfall design process is a progression through various levels of abstraction with the intent of fully characterizing each level before moving to the next level [Sha95]. As the complexities of behavioral specifications increase, both design flows are becoming more vulnerable to the EC process due to the demand for updating designs throughout many stages. To address this issue, we have developed a generic EC methodology, applicable to all design stages, which facilitates constraint manipulation to augment the design with flexibility for future changes. The EC is conducted by searching for a correction that induces minimal hassle of the optimized solution. Flexibility for EC is achieved in a synthesis pre-processing step as shown in Figure 8.3. The initial behavioral design description BD is augmented with additional design constraints (BDa ). The additional constraints reflect the demand for flexibility. For example, for register allocation, i.e. graph coloring, in order to impose that two variables which may be stored in the same register are assigned to different ones, an edge has to be added between these variables in a pre-processing
156
Behavioral design spec (BD)
Additional constraints
Behavioral design spec (BD_a)
EC pre-processing algorithm
Alterations
Design partitioning
Spec part to leave intact (rBD_a)
Additionally constrained design spec (BD_a)
Constraint manipulation
Off-the-shelf Synthesis Tool
Optimized spec (OptD_a)
Off-the-shelf Synthesis Tool
Pre-processing for EC
(Binary) Search for min hassle ended?
Spec part to update for EC (OBD_a)
Spec part merger
rcBD_a merged with OBD_a NO
YES
Corrected optimized spec (cOptD_a)
Post-processing for EC Flow of specs during EC BD_a
rBD_a OBD_a
MBD_a
rcBD_a OBD_a
Off-the-shelf Synthesis Tool
Opt. solution to MBD_a
Synthesis Tool OptD_a
Merged optimization subsolutions to rBD_a and OBD_a
cOptD_a
Figure 8.3: The design flows for design-for-EC and post-processing for EC. step to graph coloring. The application of the optimization algorithm to BDa provides a solution OptDa that can satisfy both the original and EC-targeted constraints. The additional design constraints can be focused towards a particular type of an error or augmented to provide a guaranteed flexibility for EC after an arbitrary error is diagnosed. The trade-off of having significant design flexibility with respect to a small hardware overhead can be tuned according to the designer’s needs. The error correction post-processing is performed on the augmented design
157
specification BDa with a desire to alter as few as possible design components and create an optimized design with a given functionality cOptDa . The error correction process is conducted iteratively in a loop with three steps. In the first step, the correction process is restricted to a partition OBDa ∈ BDa . OBDa contains a set of corrections and its closest neighborhood. The optimization process is applied only to this portion of the design, while the optimization solution for the remainder of the graph is left intact. In the second step, the constraints of the remainder of the design rBDa = BDa − OBDa are manipulated. Although the manipulated part of the design, rcBDa , presents a problem of smaller cardinalities, its constraints have the same impact on OBDa . The constraint manipulation algorithm is heavily dependent upon the actual optimization problem. Details of several such algorithms are presented in the Section 5. In the last step, the off-the-shelf optimization algorithm is applied to the merger of parts M BDa = rcBDa ∪ OBDa . Portion of the solution to this problem which corresponds to OBDa is then replaced in the initial optimized solution OptDa resulting in a corrected optimized solution cOptDa . The increased flexibility for EC on the initial design specification BDa enables more efficient search for the update that satisfies the correction. The described loop is repeated in a search for the smallest subdomain OBDa of the original specification where the error correction is performed.
8.4
The EC Algorithms for Behavioral Synthesis
We applied the proposed EC methodology to two behavioral synthesis tasks: operations scheduling and register allocation. For each of these tasks, we have defined the corresponding design-for-EC and post-processing algorithms for EC, outlined effective algorithms for constraint manipulation, and demonstrated the
158
approach using a second order Gray-Markel ladder filter as an explanatory example (see Figure 8.4). An example CDFG. IN
D1
A1
D2
A3
C1 +
+
IN +
A1
*
C1
2 3
A2 C5
+
A6
+
+
+
A3
5
A4
6
C3
7
A7
+
A5
A1
*
+
+
A6
A3 C2
IN
A4 requires a dedicated register.
*
+
*
C4
8 A8
C4 A8 OUT
A7
A2
C1 C2
D2
C3 C5 +
4
+
D1
A5
*
A4 +
A2 A5
1
C2
+
C5
+
A6
9 D1
OUT
D2
A8
A7
C4
C3
Interval graph for the CDFG.
Scheduled CDFG.
Figure 8.4: A second order Gray-Markel ladder filter: CDFG, its scheduling and the corresponding interval graph.
8.4.1
Register Allocation and Binding
Values generated in one control step and used in a later step must be stored in a register during the intermediate control step transitions. The lifetime of a variable spans between the time it is generated and its last use. Two variables whose lifetimes do not overlap can be stored in the same register. Register assignment is modeled as coloring of a interval graph of a CDFG, where an interval graph [DeM94] is constructed by creating a node for each variable in the CDFG and an edge if the lifetimes of adjacent nodes (variables) overlap. The GRAPH K-
159
COLORABILITY problem is solvable in polynomial time for K = 2, but remains NP-complete for all fixed K ≥ 3 [Gar79]. The left-edge algorithm [Kur87] is optimal only for interval graphs constructed from CDFGs with no loops.
8.4.1.1
Design-for-EC
During design-for-EC, the initial interval graph is augmented with edges that enable flexibility for recoloring. The edges are augmented in correspondence with two types of errors that may occur. The first type of errors are the ones where a variable V (with lifetime [CVS , CVE ]) is modified to be used as an operand in an operation Oi that is executed out of the lifetime of variable V, (COi > CVE ). Such error is modeled by adding edges (type-I) to the interval graph. The procedure that adds edges of type-I for flexibility in EC is presented in Figure 8.5. The goal of this procedure is heuristically defined and targets expansion of variables with short lifetimes. Given M , the maximal number of alive variables at any control step Ci (M = max(AliveV ars(Ci ), i = 1, . . . , |C|)), the procedure expands the lifetime of a variable if this expansion does not increase the number of alive variables at any control step over M − 1. In addition, at each control step Ci , only M − 1 − AliveV ars(Ci ) variables with the shortest lifetimes can be expanded for a single step. In Figure 8.6, it is shown how the lifetimes of variables A1, C1, A5, A3, C2, and C3 (bold edges in the CDFG and interval graph) are expanded. Such register assignment can be used to resolve a number of corrections with minimal update as shown in Figure 8.10. The second type of errors are the ones where entire operations (variables) are added to the spec. If such an operation is added at the part of the interval graph where a maximal clique occurs, the EC process would require an addition of a new register. Otherwise, it may happen that this variable can be stored
160
in the existing register file by recoloring the graph. While the first type of a consequence can be trivially solved, the second one requires more attention. To enable effective rescheduling, we identify and/or enable tuples of variables which can switch their registers arbitrarily. M = max(AliveV ars(Ci ), i = 1, . . . , |C|) Repeat For each control step Ci Subset of variables W = {Vi , CVEi = Ci−1 }. Select subset Wk ∈ W of K < M − 1 − AliveV ars(Ci ) variables with shortest lifetimes. For each Vi ∈ Wk CVEi = Ci . until Wk is not empty
Figure 8.5: Procedure used to embed edges of type-I into an interval graph. Definition 1. A clique of nodes V (Vi , i = 1, . . . , k) in a graph for which each node Vj that is a neighbor to Vk ∈ V , Vj is also a neighbor to all other nodes in V , is called k-way colorable k-clique. For example, consider two adjacent nodes A and B. If the sets of adjacent nodes to A and B are identical then nodes A and B can be colored with colors c1 and c2 or c2 and c1 respectively in any valid coloring of the graph. For example, in Figure 8.6, nodes C4 and A7 have the same set of neighbors {C3, A6} and, therefore, can be colored in {c1, c3} or {c3, c1}. At the time of correction, this property can be used for minimal-hassle graph recoloring. The pre-processing, which identifies and enables k!-way colorable k-clique of nodes, consists of two steps. In the first step, as many as possible cliques of large cardinality are heuristically selected and augmented with edges (type-II)
161
IN
D1 +
D2
A4 requires a dedicated register.
A1
C1(c3)
1 *
2 3
A2 C5
+
+
A5 +
*
4
C2
A3
6
C3
7
A7
*
+
+
*
+
A6
C5(c1)
C4
A7(c1)
8
OUT
C3(c2) C4(c3)
Interval graph for the CDFG augmented with edges for EC.
+
9 D1
A6 (c4)
A8(c2) *
A8
C2 (c3)
IN(c1)
5 A4
A5(c1) A3(c2)
A1(c2)
C1
A2(c4)
D2
Scheduled CDFG.
Figure 8.6: An example of addition of type-I constraints to the graph coloring problem. such that they become k!-way colorable. In the second step, a heuristic search identifies as many as possible cliques which are already k!-way colorable or require an addition of a small number of edges to become such. The goal of both steps is to maximize the number of nodes which are part of a k!-way colorable k-cliques. The procedure that augments type-II edges into a graph coloring instance is outlined using the pseudo-code in Figure 8.7. In its first phase, the procedure sorts the set of control steps in descending order of the number of alive variables. Then, for each control step Ci , it identifies the variables which constitute a clique of cardinality AliveV ars(Ci ). The neighborhood of the clique is analyzed whether it has a good potential for embedding edges of type-II. “Good potential” is heuristically defined with a bound on the number of edges adjacent to the nodes in the clique that has to be added in order to enable clique’s k!-way col-
162
Sort CS = Sort(C) according to ascending AliveV ars(Ci ) M = max(AliveV ars(Ci ), i = 1, . . . , |C|) For each control step CSi and its clique CL Find a set of edges E+ necessary to be added to all nodes Vi ∈ CL such that CL can be arbitrarily colored. For each edge Ei ∈ E+ If for any Ci addition of Ei results in AliveV ars(Ci ) < M break If no break Add each edge Ei ∈ E to the interval graph Remove nodes in CL and adjacent edges from IntervalGraph For each edge Ei ∈ IntervalGraph between nodes A and B If |N ei(A) ∈ / N ei(B) ∪ N ei(B) ∈ / N ei(A)| > α Add edges A − N ei(B) ∈ / N ei(A) ∪ B − N ei(A) ∈ / N ei(B) if none of them results for any Ci that AliveV ars(Ci ) < M
Figure 8.7: Procedure used to embed edges of type-II into an interval graph. orability. In addition, for each control step, the added edges should not increase AliveV ars(Ci ) above M . The bound M can be increased if the designers decide to include extra EC registers. The second phase, which enables a large number of 2-way colorable 2-cliques, starts by assigning weights to edges in the interval graph. The weight for an edge between nodes A and B is equal to the sum of number of nodes which are adjacent to one but not both nodes A and B. Next, all edges with weights greater than a predetermined threshold value α are removed from the graph. For each edge EA,B , W (EA,B ) < α, we add a set of edges E+ to nodes A an B such that can be arbitrarily colored. Of course, the addition of each edge E ∈ E+ is bounded by the increase of AliveV ars(Ci ) beyond M for any control step Ci . An example of addition of such edges is shown in Figure 8.8. Pairs of nodes {IN, A1}, {A1, C1},
163
{A3, C2}, and {A7, C4} are enabled for arbitrary colorability by addition of edges drawn in bold in the appropriate interval graph. IN
D1
D2
C1(c3) A1(c2)
+
A1
*
C1
A2(c4)
A5(c1) A3(c2)
1 2 3
A2 C5
+
+
A5 +
*
4
C2
A3 C5(c1)
*
5 A4 6
C3
7
A7
+
+
A6
A6 (c4)
A8(c2) A7(c1)
*
+
*
C3(c2) C4(c3)
Interval graph for the CDFG with included error type-I edges augmented for EC. Demonstration of embedding edges of type-II for EC.
C4
8 A8
C2 (c3)
IN(c1)
+
9 D1
OUT
D2
Scheduled CDFG.
Figure 8.8: An example of addition of type-II constraints to the graph coloring problem.
8.4.1.2
Post-processing for EC
Once the error is detected, the goal of the post-processing for EC is to update the smallest part of the design in order to achieve the desired functionality (timing). We have developed an approach which iteratively identifies a locality around the zone that needs to be updated, manipulates the interval graph that is out of the identified locality, and applies off-the-shelf coloring tools on the merger of the manipulated and updated specification. The post-processing procedure is explained in detail in Figure 8.9. The first step in the post-processing function performs a simple binary search
164
on the size of the subgraph rBDa which will be left intact by the EC process. The linear parameter on which we perform the binary search is the maximal distance from any correction on the interval subgraph OBDa = BDa − rBDa to any node in rBDa . The second step involves manipulation of constraints in the subgraph rBDa . The result, rcBDa , of this step is obtained in such a way that nodes colored with the same color in the optimized solution OptDa are merged as formally described in Figure 8.9. The new merged node inherits all edges adjacent to the parent nodes. Next, an off-the-shelf graph coloring algorithm is applied to the merger of rcBDa and OBDa . The colors of nodes in OBDa from the coloring solution to the merger of rcBDa and OBDa are copied to the initial optimized solution OptDa resulting in a corrected coloring cOptDa . An example of a merger of nodes is shown in Figure 8.10. The correction of the CDFG is shown in the left part of Figure 8.10. The new operation X is also added to the interval graph. The intent is to modify the colors of nodes only in the shaded area. Therefore all other nodes are manipulated into four nodes {A2}, {C4, C1}, {A3, A1, A8}, and {IN, A5, C5, A7}. Obviously, the coloring of this new instance satisfies the new constraints added by X while preserving the colors of nodes outside the shaded area.
8.4.2
Operation Scheduling
In scheduling, the set of arithmetic and logical operations in the CDFG are partitioned into groups of operations so that the operations in the same group can be executed concurrently in one control step. In this section, we present the main properties of developed algorithms for design-for-EC and post-processing for EC of operation scheduling solutions. The goal of the design-for-EC procedure is defined heuristically and it targets dispersing a portion of the idle control steps
165
Repeat OBDa = BinarySearch(IntervalGraph, distance[i]). rBDa = IntervalGraph − OBDa . rcBDa = M anipulate(rBDA ). Constraint Manipulation For each color Ci Create a new node VI . For each node Vj ∈ rBDa colored in Ci For each edge Ej,k adjacent to Vj If VI is not adjacent to Vk Add an edge EI,k between VI and Vk . Remove Vj from rBDa . rcBDa = rBDa . cOptDa = GraphColoring(rcBDa ∪ OBDa ) until solution found Update colors OptDa of all nodes in OBDa with their appropriate colors in cOptDa .
Figure 8.9: Procedure used to perform the error correction process with minimal hassle. equally over the computation. By doing this, we expect to enable larger amount of operations to have flexibility for rescheduling. The design-for-EC procedure is described formally using the pseudo-code in Figure 8.11. Initially, it adds a K-input single output unit X to the design. Next, a chain Ochain of CP successive operations of type X are added to the CDFG in order to force a critical path of length CP . At control steps at which a particular unit U is desired to be idle, operations of type U are attached to the augmented chain Ochain as shown in Figure 8.12. Using this approach, the added operations can either guarantee that a particular unit is idle at some control step or that
166
IN
D1 +
Interval graph for the corrected CDFG. A5 A2 C1 A3 A1
D2
A1
*
2 3
A2 C5
C2
IN
1 C1
+
+
A5
X +
*
A3 C5
4
C2
5 A4 6 X
C3
+
+
A7 EC is performed only A2 (c4)
*
+
C3
A8
A6
+
in this locality
7 A7
A6
*
*
C4
C4 A3,A1,A8 (c2)
8 A8
+
9
C2 (c3)
X D1 OUT D2 Scheduled corrected CDFG.
IN,A5, C5.A7 (c1)
The coloring solution for the corrected CDFG.
C4,C1 (c3)
A6 (c4) C3 (c2)
Figure 8.10: Post-processing for EC: graph bipartitioning, constraint manipulation, and coloring. in some range of control steps a particular unit has at least one idle control step. Obviously, parameter K is equal to the maximum number of idle units at a single control step. The frequency of adding the constraints is calculated as a user-specified percentage α of the fraction of total idle control steps and the number of functional units. An example of such constraint augmentation is presented in Figure 8.12. The chain of operations of type X as well as augmented additions and multiplications are presented in the shaded area. While the added multiplication enforces that a multiplication unit is idle in control step 6, the two added additions enforce that at least one addition unit is idle during control steps {7, 8} and {4, 5}. The
167
appropriate scheduling that satisfies such constraints is presented in the same Figure. Compute critical path CP . Add a costless functional unit X to the allocation list. Add a chain of CP operations of type X to the design spec. Count the number of idle steps CSU in the computation for each set of functional units of type U . For each set of identical functional units U For i = 1, . . . , α · CSU Add an operation of type U which uses as operands variables created by operations of type X executed in control step ASAP =
i−1 α·CSU
· CP and generates a variable
which is used in control step ALAP =
i α·CSU
· CP .
Figure 8.11: Procedure used to perform the design-for-EC process for operation scheduling solutions. The error correction process of selected areas in the CDFG is performed using the same sequence of steps as described in Section 4. The only algorithm that is specific to operation scheduling is the constraint manipulation procedure. This procedure has two inputs: a selection of operations O within the CDFG which can be altered; and a frame {Cstart , Cend } of control steps in the iteration. The routine creates a new CDFG from all operations O+ within {Cstart , Cend }. The As Soon As Possible and As Late As Possible scheduling boundaries for each operation in O+ are updated to be within the selected frame. Then, the procedure manipulates the constraints of the subgraph O+ − O in the following way. It adds a costless K-input single output computation unit of type X to the hardware allocation. A chain Ochain of Cend − Cstart successive operations of type X is added to the new CDFG in order to enforce a critical path of length CP . Scheduled operations
168
IN
D1
X
+
A1
X
*
C1
D2
Enforcement of constraints: at least addition is enabled at control steps 4 and 5, one multiplication at control step 6, and at least one addition at control steps 7 and 8.
1 2 X
A2
+
+
EC is performed only in this locality
A5
A2
3 X
+
*
C5
4
C2
X 5 X
A4
*
6
C3
X 7 X
+
+
A7
A3 4
*
+
Figure 8.12:
A4
A6 6
A7
C4
+
+
X X
*
+
C4
*
X
X D1 D2 Manipulation of constraints which are not part of the updated subgraph. {O+ - O}
+
OUT
C3
+
A6
8
9 D1
X
*
7 *
A8
C2
NEW *
8 X
D2
5
+
+
A3
{O}
D2
Constraint augmentation and manipulation for pre- and
post-processing for EC of operation scheduling. of the subgraph O+ − O are connected to the chain Ochain in a way that any algorithm can retrieve, as trivial, a solution for scheduling equivalent to the one that exists in the initial scheduling. An explanatory example of such manipulation is illustrated in Figure 8.12. The correction introduces the new addition N EW to the CDFG. All corrections are done within the frame of control steps {4, . . . , 8}. Operation C4 is within the selected frame but it is not assumed to be rescheduled. Therefore, its variables are fed from the chain of additional operations of type X which enforce it to be scheduled as in the initial scheduling.
8.5
Experimental Results
In order to evaluate the developed EC algorithms, we have conducted experiments on several real-life designs using HYPER as a behavioral compiler [Rab91]. The collected data is presented in Tables 8.1 and 8.2. In both tables, column 1 shows
169
Design Description
Graph Coloring
Available
Critical
Vari-
Regi-
D-for-EC
Only
Complete
control steps
path
ables
sters
and EC
EC
resynthesis
8th CF IIR
36 → 33
18
35
19
22
23
21
Linear GE Ctlr
24 → 22
12
48
23
25
28
25
Wavelet Filter
32 → 29
16
31
20
22
24
21
Modem Filter
20 → 18
10
33
15
18
19
18
Volterra 2nd ord.
12 → 10
24
28
15
17
19
17
D/A Converter
132 → 120
264
354
171
182
187
179
Long Echo Cnclr
5132 → 4500
2566
1082
1061
1068
1072
1065
Table 8.1: Engineering change experimental results: overhead of performing modifications on register allocation instances using design-for-EC and EC, only EC, and complete resynthesis. the description of the design. Columns 2-5 show the design properties: available control steps, critical path, number of variables, and registers. To test the EC methodology for graph coloring, we have reduced the number of available control steps approximately 10% for each design (specified in column 2). This modification has resulted in a changed interval graph. We have employed three EC techniques: one that performs both the design-for-EC and post-processing step, another that post-processes the design for EC, and finally, we used complete recoloring to retrieve the approximate lower bound on the quality of previous two techniques. The number of registers required to store all variables for modified designs for each of the three EC techniques is presented in columns 6-8 in Table 8.1. To test the EC algorithms for operation scheduling, we have iteratively generated (1000 iterations for each design) one or two errors in the behavioral description and then applied the aforementioned three EC methods to reschedule the design. Only one type of errors has been induced: changing types of opera-
170
tions (e.g. addition to subtraction). Modifications for each error were searched within a frame of one tenth of the total available control steps. The frames were symmetrically positioned with respect to the error. Pairs of columns 6-7, 8-9, and 10-11 in Table 8.2 show the percentage of successfully performed modifications for a single and double error for all three methods respectively. In both applications the developed EC techniques have performed competitively with respect to complete resynthesis. Design Description
Operation Scheduling
Available
Critical
Vari-
Regi-
D-for-EC
Only
Complete
control steps
path
ables
sters
and EC
EC
resynthesis
1E
2E
1E
2E
1E
2E
8th CF IIR
36 → 33
18
35
19
92%
87%
90%
84%
94%
91%
Linear GE Ctlr
24 → 22
12
48
23
90%
86%
88%
81%
93%
91%
Wavelet Filter
32 → 29
16
31
20
95%
92%
91%
87%
96%
95%
Modem Filter
20 → 18
10
33
15
96%
92%
90%
84%
97%
96%
Volterra 2nd ord.
12 → 10
24
28
15
94%
92%
86%
82%
95%
93%
D/A Converter
132 → 120
264
354
171
98%
94%
94%
92%
98%
96%
Long Echo Cnclr
5132 → 4500
2566
1082
1061
97%
93%
94%
91%
99%
95%
Table 8.2: Engineering change experimental results: overhead of performing modifications on operation scheduling instances using design-for-EC and EC, only EC, and complete resynthesis.
8.6
Conclusion
We have introduced a novel design methodology which facilitates design-for-EC and post-processing to enable EC with minimal perturbation. Initially, as a synthesis pre-processing step, the original specification is augmented with additional
171
design constraints which ensure flexibility for future correction. Upon alteration of the initial design, a novel post-processing technique achieves the desired functionality with near-minimal perturbation of the initially optimized design. As a key contribution, we have highlighted a constraint manipulation technique which enables reduction of an arbitrary EC problem into its corresponding classical synthesis problem. As a result, traditional synthesis algorithms can be used to enable flexibility and perform local alterations.
172
CHAPTER 9 Intellectual Property Protection by Watermarking Combinational Logic Synthesis Solutions The intellectual property (IP) business model is vulnerable to a number of potentially devastating obstructions, such as misappropriation and intellectual property fraud. We propose a new method for IP protection (IPP) which facilitates design watermarking at the combinational logic synthesis level. We developed protocols for embedding designer- and/or tool-specific information into a logic network while performing multi-level logic minimization and technology mapping. We demonstrate that the difficulty of erasing author’s signature or finding another signature in the synthesized design can be made arbitrarily computationally difficult. We also developed a statistical method which enables us to establish the strength of the proof of authorship. The watermarking method has been tested on a standard set of real-life benchmarks where exceptionally high probability of authorship has been achieved with negligible overhead in solution quality.
9.1
Introduction
The complexity of modern system synthesis as well as shortened time-to-market requirement has resulted in design reuse as a predominant system development
173
paradigm. The new core development strategies have affected the business model of virtually all CAD and semiconductor companies. To overcome the difficulties in core-based system design, the VSI Alliance has identified six technologies crucial for enabling effective design reuse: system verification, mixed signal design integration, standardized on-chip bus, manufacturing related test, system-level design, and intellectual property protection (IPP) [VSI]. We have developed the first approach for IPP which facilitates design watermarking at the combinational logic synthesis level. The watermark, a designerand/or tool-specific information, is embedded into the logic network of a design at a preprocessing step. The watermark is encoded as a set of design constraints which do not exist in the original specification. The constraints are uniquely dependent upon author’s signature. Upon imposing these constraints to the original logic network, a new input is generated which has the same functionality and contains user-specific information. The added constraints result in a trade-off. The more additional constraints, the stronger the proof of authorship, but the higher overhead in terms of quality of the synthesis solution. However, the application of the synthesis algorithm results in a solution which satisfies both the original and constrained input. Proof of authorship is based upon the fact that the likelihood that another application returns a solution to both the original and constrained input is exceptionally small. The developed watermarking technique is transparent to the synthesis step and can be used with any logic synthesis tool. We demonstrate that the developed IPP approach can be used to: • P rove authorship of the design at levels of abstraction equal or lower than logic synthesis. Existence of a user-specific signature in the solution of a multi-level optimization or technology mapping problem clearly identifies the author of the input design specification (initial input logic network).
174
• P rotect the synthesis tool. The signature of the tool developer, embedded in logic synthesis solutions, clearly indicates the origin of the synthesis tool.
9.2
Watermarking Desiderata
The recently proposed Strawman initiative [VSI] of the Development Working group on IPP calls for the following desiderata for techniques which act as deterrents in order to properly ensure the rights of the original designers. • Functionality Preservation. Design specific functional and timing requirements should not be altered by the application of IPP tools. • Minimal Hassle. The technique should be fully transparent to already complex design and verification process. • Minimal Cost. Both the cost of applying the protection technique and its hardware overhead should be as low as possible. • Enforceability. The technique should provide strong and undeniable proof of authorship. • Flexibility. The technique should enable a spectrum of protection levels which correspond to variable cost overheads. • Persistence. The removal of the watermark should result in a task of the difficulty equal to the complete redesign of the specified functionality. In addition to the stated VSI intellectual protection requirements, our approach also provides proportional protection of all parts of the design.
175
9.3
Watermarking Logic Synthesis Solutions
The synthesis flow which employs watermarking of combinational logic synthesis solutions encompasses several phases illustrated in Figure 9.1. The first three phases in the watermarking approach are the same for both multi-level logic minimization and technology mapping. The Original Design Specification
Assignment of an unique ID to each gate in the netlist
EDA Standard for IP Protection
The Synthesis Automation Tool
Netlist
Ordered set of nodes
Keyed one-way pseudo-random node permutation
Author’s ID Secret Key
Signature-driven node permutation
Adding signature-specific constraints to the design. Enforcement of first-K nodes to appear in the final solution. The existance of these nodes in the solution constitutes the watermark of the logic synthesis solution. First-K nodes Enforced primary output
The Additionally Constrained Specification
The Watermarked Optimized Design. Netlist
Applying the Synthesis Automation tool. Netlist
Technology-mapping
Figure 9.1: The protocol for hiding information in solutions for multi-level logic optimization and technology mapping. In the first step, to ensure that the watermark cannot be misinterpreted, the gates in the initial logic network specification are sorted using an industry
176
standard. As a result of this procedure, each gate of a given logic network can be assigned a single identifier which is unique with respect to the identifiers assigned to gates in the remainder of the network. Next, K gates are selected in a way specific to the designer’s or tool developer’s signature. We use a keyed RC4 one-way function to generate pseudo-random bits [Men97] which guide the process of iterative gate selection. The outputs of the selected gates are explicitly assigned to become primary outputs. We have applied this protocol to the technology mapping synthesis step. Although the same protocol can be applied to watermark multi-level logic minimization solutions, for this task we provide an alternative protocol. Initially, it also generates pseudo-primary outputs according to the user’s signature, and, in addition, uses them as inputs into an additional logic network which is embedded into the initial design specification. The protocol builds the embedded network according to the designer’s or tool developer’s signature. After additionally constraining the initial design specification, the optimization algorithms are applied to the constrained logic network. The result retrieved by the synthesis algorithm satisfies both the initial and constrained design specification. The proof of authorship is dependent upon the likelihood that some other algorithm, when applied to the initial input, retrieves solution which also satisfies the constrained input.
9.3.1
Gate Ordering
The watermarking process starts by assigning a unique identification number IDi to each gate Gi from the set G of gates which are not used as primary outputs. The unique identification number IDi is selected from the set IDi ∈ ID = {1...N } of N successive numbers, where N is the cardinality of the set G. We
177
have two main goals in this step: to map the network into a linear array so that cryptographical tools can be directly applied and to develop a uniquely defined IPP procedure in such a way that the degrees of freedom for potential attackers are maximally reduced. Given a logic network LN = {G1 , ..., GN , C} with a set I = {I1 , ..., IK } of inputs and set O = {O1 , ..., OL } ∈ G of output nodes. Ordered set M of sets of nodes M = {M0 = G − O} For each Criteria Function C[i], i = 1..8 For each set of nodes Mi ∈ M with |Mi | > 1 For each node Gj ∈ Mi compute Gj .objective = C[i](Gj ) Partition Mi into an ordered set of unordered sets Mi,P1 , ..., Mi,PK such that all Gj ∈ Mi,Pk have the same Pk = Gj .objective and Pk > Pk+1 . Augment the new set of partitions into the initial set M in the following order ..., Mi−1 , Mi,1 , ..., Mi,K , Mi+1 , ... For each set of nodes Mi ∈ M with |Mi | > 1 Randomly partition Mi into an ordered set of unordered sets Mi,P1 , ..., Mi,PK each of cardinality equal to 1. Augment the new set of partitions into the initial set M in the following order ..., Mi−1 , Mi,1 , ..., Mi,K , Mi+1 , ...
Figure 9.2: Proposed function for completely defined node ordering. To avoid misinterpretation of this ordering, we propose that an industry standard has to be established. The network has to be numbered in such a way that any two nodes that have different functionality and different transitive fan-in and fan-out networks are assigned different IDs. However, finding whether two nodes are functionally and topologically identical is a hard problem. The special case of the problem of finding whether two networks are identical, when all gates per-
178
form equivalent functions, is equivalent to the graph isomorphism problem. This problem has been listed as open in terms of its complexity [Gar79]. Therefore, we propose a heuristic function that exploits the functional and timing properties of a node, to sort the nodes in a logic network. This function is explained using the pseudo-code in Figure 9.2. It performs iterative sorting of nodes, not used as primary outputs, using a list of criteria with distinct priorities. The objective of the ordering function is to partition a logic network LN (G, C), where G is a set of nodes and C is a set of connections between nodes, into an ordered set M of node subsets Mi ∈ G such that each subset contains exactly one node. We propose the following list of eight criteria for node identification: C[1] The level LINi of node Gi with respect to the input. A node Gi has a level K if the longest path in the logic network from any input to Gi is of cardinality K. C[2] The level LOU Ti of node Gi with respect to the output. A node Gi has a level K if the longest path in the logic network from any output to Gi is of cardinality K. C[3] Number of nodes in the transitive fan-in of Gi at level K < LINi . C[4] Number of nodes in the transitive fan-out of Gi at level K < LOU Ti . C[5] Functionality, fan-in, and fan-out of nodes in the transitive fan-in of Gi at level K < LINi . C[6] Functionality, fan-in, and fan-out of nodes in the transitive fan-out of Gi at level K < LOU Ti . C[7] Functionality, fan-in, and fan-out of the fan-in and fan-out of nodes in the transitive fan-in of Gi at level K < LINi .
179
C[8] Functionality, fan-in, and fan-out of the fan-in and fan-out of nodes in the transitive fan-in of Gi at level K < LOU Ti . Tfout[3].fanout=1,1
Tfin[2].nodes=1 Tfin[2].nodes=2 Tfin[3].fanin.func= OR,3AND
fanout=2
Tfout[3].fanout=1,2 Primary output Tfin[3].fanin.func= 3AND fanout=3
fanout=2 fanout=1
1
2
3
4
5
Figure 9.3: An example of ordering nodes according to the proposed set of sorting criteria. An example how nodes are identified using the proposed set of sorting rules is given in Figure 9.3. Note that it is unlikely that two nodes have all parameters identical. This is due to the dependencies and non-symmetry between nodes in logic networks. If two nodes cannot be distinguished using the proposed set of rules, we assign random unique identifiers to these nodes and memorize the assignment for future proof of authorship.
9.3.2
Watermark Encoding and Embedding
In the next phase of watermarking, from the sorted set M of non-primary nodes, a subset S ∈ M of cardinality |S| = K is selected. The selection is pseudo-random
180
and corresponds uniquely to the designer’s or tool developer’s signature. Next, each node in the selected subset S is explicitly added to the list of pseudo-primary outputs. By performing this step, the watermarking routine enforces nodes from the set S to be: • visible in the final technology mapping solution. • computed during the multi-level logic minimization of the logic network. Note that many subfunctions that exist in the input logic network do not exist in the optimized output logic network.
ILN is an input logic network. P P N is an ordered set of pseudo primary nodes. P RS = RC4bitGenerator(key1, key2), where key1, key2 are the designer and/or tool developer signature. ListAddedGates = null Repeat (Standard.cardinality of added gates) Gate G = Select(Library, PRS, Pointer) AddedGates.Add(G) ILN.Add(G, G.f anin[Select(P P N, P RS, P ointer) P P N.Add(G.f anout) End Repeat For each gate G ∈ AddedGates If G.fanout = null ILN.f anout.Add(G)
Figure 9.4: Proposed function for watermarking multi-level logic minimization solutions using network augmentation. The node selection is performed in the following way. Since the node selection step of watermarking is not assumed to be the computation bottleneck, we use the RC4 cryptographically secure pseudo-random bit-generator [Men97] to generate
181
a sequence of bits which decides upon node selection. The keys used to drive the randomization process represent the user signature. The result of this phase is a pseudo-random signature-specific selection of a combination of K network nodes. In the case of technology mapping of LUT-based FPGAs, the described node selection phase is the last phase in the protocol. However, it is important to stress the implications of a specific phenomenon in this problem. Cong and Ding [Con96a] have identified a class of MFFC nodes which are more likely to appear in the final solution than the remaining nodes. We have statistically evaluated the impact of this phenomenon on the strength of the proof of authorship enabled by our approach. For each instance of the problem, we have explicitly enumerated the ratio of MFFC nodes in the initial input specification (rin) and in the final solution (rout). We compute the likelihood of solution coincidence using the )rout·W · ( (1−rout)·F )(1−rout)·W , where F is the number following formula: p = ( rout·F rin·T (1−rin)·T of non-primary gates in the final solution, T is the total number of non-primary gates in the initial logic network, and W is the number of gates pseudo-randomly selected to become pseudo-primary outputs during the watermarking phase. The protocol described for technology mapping can be applied to watermark solutions to the multi-level logic minimization problem. However, we propose an alternative protocol which provides stronger proof of authorship due to embedded additional constraints. This protocol augments signature-specific constraints into the input logic network in two phases. In the first phase which is equivalent to the already described protocol for watermarking technology mapping solutions, the protocol marks the outputs of selected gates as visible by explicitly denoting them as pseudo-primary outputs. In the second phase, an additional network is augmented into the input. The additional network has as input variables the pseudo-primary output variables generated in the previous phase. The network
182
is built according to the user’s signature. The pseudo-code for building the additional signature is presented in Figure 9.4. The sequence of pseudo-random bits from the previous phase is used to provide a source of undeniable determination. Using this sequence, firstly, a gate G from the available library of gates is selected. Then according to the pseudo-random sequence of bits, G.f anin pseudo primary outputs are selected and used as inputs to the selected gate G. The output G.f anout is added to the list of pseudo-primary outputs. This output is subject to selection in the future iterations of this procedure. This procedure can be infinitely repeated. A possible termination policy may be established using industry adopted standards. The additionally constrained original input netlist is fetched to the optimization algorithm (multi-level logic minimization or technology mapping). The final solution is a network of cells (or subfunctions) which contains solution to the original problem and to the user-specific augmentation of the original problem. The proof of authorship relies on the difficulty to: modify the input in such a way that the pseudo primary outputs that correspond to the attacker’s signature and the modified network that corresponds to the the attacker’s key have a subsolution that is a subsolution to the initial problem watermarked with the designer’s watermark.
9.3.3
Persistence to Possible Attacks
The attacker may try to modify the output locally in such a way that the watermark disappears or the proof of authorship is lowered bellow a predetermined standard. Therefore, the watermarking scheme has to be such that, to delete the watermark and still preserve solution quality, the attacker has to perturb great deal of the obtained solution. This requires the attacker to develop a new opti-
183
mization algorithm. For example, consider a design that has a total of 100000 gates. In the final solution S, 10000 nodes are visible (LUT or cell outputs) and therefore the average probability, that a node from the initial network is visible in the final solution, is p =
1 . 10
If the watermarking strategy results in a
pseudo-random selection of 1000 visible vertices, inherently, the average probability that a node, visible in S, is visible in a solution obtained by some other algorithm is p. That is, if the challenging algorithm retrieves a solution of the same quality. The probability expectation P , that some other algorithm selects exactly the same subset S of nodes in the final solution, is P = p1000 or one in 101000 . Consider that the attacker aims to reduce the likelihood of authorship by doing local changes to the design in order to remove the watermark. To reduce the proof of authorship to one in a million, the attacker has to alter 851 node from the watermark, i.e. 85.1% of the final solution. To remove the watermark in such a way that the remaining proof of authorship is P = 0.1, the attacker has to modify 888 vertices in the watermark or 88.8% of the entire technology mapping solution. There are two scenarios how the attacker can try to find his or her signature in an already watermarked solution (see Figure 9.5). The first one is a topdown approach, where the attacker modifies the input hoping that the tool will produce an output that contains attacker’s signature (as well as the author’s signature). Since node permutation is pseudo-randomized, the likelihood that attacker’s signature appears in the output is the same as the probability of two different algorithms retrieving the same solution. Thus, this attack is less efficient than trying to delete the signature. In the bottom-up approach the attacker concludes from the output (or its modification), what is the input that produces output that contains her or his
184
signature. However, in order to produce such an input (and possibly output), the attacker has to know which pseudo-random selection of nodes (and augmented network) corresponds to a specific input sequence. The attacker may obtain such information only if the reverse to the one-way function is known. For RC4-type one-way hash functions such inverses are not known [Men97]. The Original Design Specification
Change the input according to "SOME" heuristic
Top-Down Apply the tool and "EXPECT" a solution with attacker’s and author’s signature
Netlist
Assignment of an unique ID to each gate in the netlist
Keyed one-way pseudorandom node permutation
Conclude what input corresponds to the changed output in such a way that both attacker’s and author’s signature can be detected in the output Author’s ID Secret Key
Bottom-Up Adding signature-specific constraints to the design.
Applying the Synthesis Automation tool.
Change the output according to "SOME" heuristic
The Watermarked Optimized Design.
Netlist
Technology-mapping
Figure 9.5: A top-down and bottom-up approach to finding attacker’s signature in an already watermarked solution.
9.4
Experimental Results
We demonstrate the effectiveness and quality of the developed IP protection approach on the problem of technology mapping for the set of MCNC benchmark designs (Table 9.1) and large industrial strength designs (Table 9.2). For LUT-
185
based 5-input technology mapping we used the CutMap algorithm [Con96b]. Although the designs evaluated on the MCNC benchmark suite are much smaller than current industrial circuits (recently announced Xilinx Virtex series of FPGAs implemented using a 0.25 micron technology are expected to encompass 1,000,000 gates), we have achieved likelihood of watermarked solution coincidence on average equal to p < 10−13 with average overhead of 4%. In two cases design watermarking resulted in negative overhead. Similarly, we obtained average p < 10−26 with average hardware overhead of 7.6%. circuit
PO gates
Non-PO gates
No-Wm
FOJ1.flat
618
1045
947
CAMI.flat
8010
20238
12091
0.50% 947
1%
0.00% 0.0023849
12177 0.71%
3.54E-70
2%
951
0.42%
6.45E-06
959
1.27%
6.84E-11
12245
1.27%
3.31E-138
12424
2.75%
2.08E-268
G4.flat
3934
26574
9825
9942 1.19%
1.59E-86
10017
1.95%
6.86E-171
10237
4.19%
1.48E-330
JPEGP.flat
7452
25727
14996
15080 0.56%
1.21E-68
15175
1.19%
3.55E-135
15357
2.41%
2.02E-264
COMPAQ_VGA
5142
47470
19338
19492 0.80% 4.79E-124
19700
1.87%
2.13E-244
19998
3.41%
1.63E-479
PISCES
6832
86058
41721
41991 0.65% 5.28E-168
42172
1.08%
3.87E-333
42662
2.26%
1.21E-655
We guarantee only the order of magnitude for numbers smaller than 2.07E-268. circuit
PO gates
Non-PO gates
No-Wm
FOJ1.flat
618
1045
947
CAMI.flat
8010
20238
12091
3% 966
2.01%
4% 1.07E-15
12593 4.15% 2.98E-393
968
2.22%
1.39E-20
12768
5.60%
2.2E-509
G4.flat
3934
26574
9825
10462 6.48% 1.22E-486
10674
8.64%
1.83E-634
JPEGP.flat
7452
25727
14996
15518 3.48% 4.20E-389
15714
4.79%
2.45E-508
COMPAQ_VGA
5142
47470
19338
20325 5.10% 1.07E-705
20638
6.72%
1.55E-924
PISCES
6832
86058
41721
43090 3.28% 1.33E-969
43556
4.40% 1.01E-1274
Table 9.1: Watermarking technology mapping solutions for the MCNC suite. Columns present, respectively: name of the circuit, number of primary outputs, number of non- primary gates in the project description, and the solution quality (number of LUTs) when algorithm CutMap [Con96a] is applied to the original design. Each three-column subtable contains a column describing the number of LUTs in the watermarked solution, the hardware overhead with respect to the non-watermarked solution, and the likelihood that some other algorithm retrieves a solution which also contains the watermark. We have applied the IPP protocol for technology mapping to a large industrial design example with over 47,000 non-primary and 5,000 primary gates. For 0.5%
186
and 1% of non-primary gates selected for assignment to pseudo-primary outputs, our approach resulted in solution coincidence likelihood of 10−124 and 10−244 , and with incurred hardware overhead of 0.8% and 1.87%, respectively. The run-time of the optimization program for the watermarked input was within ±5% of the program execution run-time for the original input. The evaluation of the developed watermarking technique for multi-level logic minimization resulted in results similar to technology mapping. We applied the MIS suite of optimization algorithms [Bra87] to the standard and watermarked set of MCNC benchmark designs. After specifying 1% or 2% of non-primary output nodes to become pseudo-primary outputs, the MIS suite retrieved in average solutions with 2% fewer or 6% more literals, respectively.
9.5
Conclusion
We have developed the first watermarking-based approach for IPP of tools and designs in the combinational logic synthesis domain. The watermark, a set of constraints which correspond to the designer’s and/or tool developer’s signature, are added to the original design specification in a synthesis preprocessing step. After the synthesis tool retrieves a solution to the optimization problem, the added constraints are satisfied in addition to the original set of design constraints. This property is used to prove authorship in court. We demonstrated that the embedded watermarks are: hard to delete and hard to find in an arbitrary solution. We have effectively applied our approach to the problem of technology mapping for LUT-based FPGAs using a set of benchmark designs.
187
circuit
PO gates
i7
67
Non-PO gates No-Wm 439
139
139
0.50% 0.00%
0.0189075
141
1.44%
1% 0.000403
111 -20.14% 1.69E-09
2% 143
4%
i2
1
530
121
121
0.00%
0.0195209
123
1.65%
0.000416
127
4.96%
2.44E-07
134 10.74% 1.87E-13
2.88%
4.22E-14
i9
63
471
140
140
0.00%
0.0140515
142
1.43%
0.000223
145
3.57%
7.05E-08
152
8.57%
2.33E-14
alu4
8
603
220
220
0.00%
0.0427805
221
0.45%
0.001883
226
2.73%
4.69E-06
235
6.82%
5.84E-11
frg2
139
507
302
302
0.00%
0.0563249
304
0.66%
0.003375
305
0.99%
1.21E-05
311
2.98%
3.01E-10
rot
107
593
287
287
0.00%
0.0291592
287
0.00%
0.00085
291
1.39%
9.38E-07
300
4.53%
2.73E-12
apex6
99
628
242
242
0.00%
0.0095976
244
0.83%
0.000101
249
2.89%
1.55E-08
255
5.37%
6.41E-16
C2670
140
716
330
330
0.00%
0.0086566
331
0.30%
7.78E-05
335
1.52%
8.15E-09
354
7.27%
9.51E-16
x3
99
681
266
266
0.00%
0.0083461
268
0.75%
7.55E-05
273
2.63%
8.49E-09
287
7.89%
5.93E-16
k2
45
820
446
448
0.45%
0.0543393
449
0.67%
0.003013
453
1.57%
1.07E-05
459
2.91%
1.84E-10
i8
81
827
517
515
-0.39%
0.069524
505
-2.32% 0.003986
493
-4.64% 9.88E-06
494 -4.45%
1.06E-10
dalu
16
1065
382
385
0.79%
0.0035382
388
1.57%
1.36E-05
397
3.93%
3.10E-10
398
4.19%
1.07E-19
t481
1
1144
543
545
0.37%
0.0142373
540
-0.55% 0.000182
545
0.37%
4.11E-08
546
0.55%
1.84E-15
C3540
22
1336
563
563
0.00%
0.0023844
568
0.89%
6.43E-06
580
3.02%
7.39E-11
589
4.62%
1.28E-20
C5315
123
1373
460
459
-0.22%
6.36E-05
457
-0.65% 3.72E-09
474
3.04%
5.42E-17
489
6.30%
2.92E-32
pair
137
1426
520
525
0.96%
9.32E-05
535
2.88%
1.25E-08
554
6.54%
5.90E-16
555
6.73%
3.99E-31
C6288
32
2417
690
705
2.17%
1.95E-07
725
5.07%
7.70E-14
746
8.12%
2.51E-26
764 10.72% 7.01E-51
3.14%
2.82E-14
809
5.89%
3.52E-27
827
C7552
108
2441
764
774
1.31%
1.30E-07
788
8.25%
1.48E-52
des i10
245 224
2788 2974
1141 1315
1097 -3.86% 1324 0.68%
6.65E-08 3.78E-07
1110 -2.72% 6.75E-15 1132 -0.79% 1.85E-28 1150 0.79% 1339 1.83% 2.13E-13 1356 3.12% 1.12E-25 1371 4.26%
3.21E-55 5.98E-50
circuit
PO gates
i7
67
439
139
141
1.44%
6.98E-28
152
9.35%
2.73E-38
171
23.02% 3.64E-39
i2
1
530
121
153 26.45%
1.00E-23
163
34.71% 1.82E-33
185
52.89% 8.09E-35
i9
63
471
140
171 22.14%
7.94E-25
184
31.43% 4.36E-34
201
43.57% 6.99E-36
alu4
8
603
220
246 11.82%
3.34E-20
256
16.36% 1.20E-28
276
25.45% 1.86E-30
frg2
139
507
302
328
8.61%
4.15E-18
337
11.59% 1.43E-25
351
16.23% 1.32E-27
rot
107
593
287
311
8.36%
1.04E-22
324
12.89% 8.55E-32
341
18.82% 2.97E-34
apex6
99
628
242
268 10.74%
2.29E-29
287
18.60% 3.36E-40
311
28.51% 3.43E-42
C2670
140
716
330
371 12.42%
7.22E-29
390
18.18% 5.46E-40
415
25.76% 2.20E-42
x3
99
681
266
305 14.66%
5.12E-29
318
19.55% 5.45E-41
337
26.69% 2.96E-44
k2
45
820
446
473
6.05%
3.00E-19
491
10.09% 9.45E-27
506
13.45% 1.94E-29
i8
81
827
517
461
######
4.53E-23
503
-2.71% 1.01E-29
497
-3.87% 2.82E-35
dalu
16
1065
382
425 11.26%
3.88E-36
450
17.80% 1.50E-50
475
24.35% 3.15E-55
Non-PO gates No-Wm
8%
12%
16%
t481
1
1144
543
563
3.68%
5.61E-29
575
5.89%
7.63E-42
590
8.66%
C3540
22
1336
563
613
8.88%
1.38E-38
641
13.85% 2.72E-54
673
19.54% 3.99E-59
6.67E-47
C5315
123
1373
460
527 14.57%
4.40E-59
569
23.70% 3.48E-81
622
35.22% 3.21E-85
pair
137
1426
520
596 14.62%
6.88E-57
631
21.35% 1.65E-79
680
30.77% 1.94E-84
C6288
32
2417
690
857 24.20%
5.43E-91
934
35.36% 6.96E-125 1001 45.07% 4.78E-135
C7552
108
2441
764
901 17.93%
4.42E-96
951
24.48% 5.59E-136 1019 33.38% 5.23E-147
des
245
2788
1141
1225 7.36%
5.31E-102
1299 13.85% 4.62E-142 1365 19.63% 2.54E-155
i10
224
2974
1315
1429 8.67%
4.48E-94
1489 13.23% 3.22E-133 1552 18.02% 1.64E-146
Table 9.2: Experimental results: watermarking LUT-based technology mapping solutions for a set of one small and five industrial designs. First four columns correspond to the columns in Table 9.1. Next, there are five subtables with structure identical to the subtables in Table 9.1.
188
CHAPTER 10 Local Watermarking: Methodology and Application to Behavioral Synthesis Recently, the semiconductor industry has adopted the Intellectual Property (IP) business model as a dominant system-on-chip development platform. Since copyright fraud has been recognized as the most devastating obstruction to this model, a number of techniques for IP protection have been introduced. Most of them rely on a selection of a global solution to an optimization problem according to a unique user-specific digital signature. Although such techniques may provide convincing proof of authorship with little hardware overhead, they fail to protect design partitions, do not provide an easy procedure for watermark detection, and are not capable of detecting the watermark when the design or its part is augmented in another larger design. Since these demands are of the highest interest for the IP business, we introduce local watermarking as an IP protection technique which enables these features while satisfying the demand for low-cost and transparency. To prove its efficiency and to provide protection resilient to modern reverse engineering techniques, we have applied the new watermarking technology to a set of behavioral synthesis tasks such as operation scheduling and template matching. We have demonstrated that the difficulty of erasing or finding another signature in the synthesized design can be made arbitrarily computationally diffi-
189
cult. The watermarking method has been tested on a set of real-life benchmarks where high likelihood of authorship has been achieved with negligible overhead in solution quality.
10.1
Introduction
The complexity of modern electronics synthesis as well as shortened time-tomarket has resulted in design reuse as a predominant system development paradigm. The new core development strategies have affected the business model of virtually all VLSI CAD and semiconductor companies. For example, recently a number of companies have consolidated their efforts towards developing off-the-shelf programmable or application-specific cores (e.g. ARM, LSI Logic, Design-andReuse). It has been estimated that more than half of all ASICs in year 2000 will contain at least one core [Tuc97]. To rapidly overcome the difficulties in core-based system design, the Virtual Socket Initiative Alliance has identified six technologies crucial for enabling effective design reuse: system verification, mixed signal design integration, standardized on-chip bus, manufacturing related test, system-level design, and intellectual property protection (IPP) [VSI97]. Recently a number of techniques have been proposed for IPP of designs and tools at various design levels: design partitioning [Wol98], physical layout [Kah98], combinational logic synthesis [Kir98b, Lac98], behavioral synthesis [Qu98, Hon98], and design-for-test [Kir98d]. All these techniques facilitate augmentation of the user’s digital signature, encoded as a set of additional design constraints, into the original design specification in a pre-processing step with respect to the application of the optimization algorithm. The additional design constraints are spread over the entire design specification, thus, providing proportional protection for the entire design. The solution retrieved by the
190
optimization algorithm satisfies both the original and user-specific constraints. This property is the key to enabling low likelihood that another algorithm (or designer) can build such a solution with only the original design specification as a starting point. Although efficient, these techniques lack support for: • Effective signature detection. Since the encoding of a digital signature is dependent upon the structure of the entire design specification, detecting an embedded signature requires unique identification of each component of the design [Kir98]. Moreover, possible design alteration by the misappropriator may negligibly, but significantly alter the design in a way that restoring the identifiers of design components requires detection of a number of subgraph isomorphisms [Kir98b]. Unfortunately, this problem is still listed as open in terms of its complexity [Gar79]. • Protection of design partitions. Mentioned techniques are quite effective in protecting overall designs, they do not provide protection for design partitions. Namely, in many designs (cores), their parts may have substantial and independent value (for example, a discrete cosine transform filter in an MPEG codec). • Copied partition detection. Commonly, misappropriated designs or their parts are augmented into larger designs. This leaves no room for the existing protection techniques to facilitate the existence of a part of the watermark in a design as a proof for authorship. In this paper, we introduce local watermarking, a generic IPP technique which provides the aforementioned protection requirements and can be applied to many combinatorial and continuous optimization problems. We have applied
191
this IPP methodology on a subset of behavioral synthesis tasks: template matching and operation scheduling. Watermarking designs at the these levels enables IP commerce of optimized behavioral specifications and RTL designs, which is exceptionally important for application-specific systems. It also protects behavioral synthesis tools and designs at levels of abstraction equal or lower than behavioral synthesis. This property is becoming increasingly important because of the progress of reverse engineering technologies (e.g. Take Apart Everything Under The Sun Co. [Tae]) which enable precise, fast, and confidential retrieval of the netlist of a silicon product. As in the previous IPP techniques, in local watermarking, a watermark is encoded as a set of design constraints which does not exist in the original specification. The constraints are uniquely dependent upon author’s signature. Rather than embedding a single error-corrected watermark over the entire design, as in the previous techniques, in local watermarking, a number of “small” watermarks are randomly augmented in the design. “Small”, in a sense that the constraints of each watermark are placed in a smaller part (locality) of the design. Each watermark exists and can be detected in its locality in the design independently upon the remainder of the design. Therefore, such watermarks enable protection for parts of the design because the copy detection algorithm does not need to see the entire design in order to decode the added constraints. Upon imposing the user-specific constraints to the original behavioral specification, a new input is generated which has the same functionality but contains user-specific information. The application of the synthesis algorithm on such an input results in a solution which satisfies both the original and constrained design. Proof of authorship is based upon the fact that the likelihood that another application returns a solution to both the original and constrained input is ex-
192
ceptionally small. The added constraints may result in a synthesis trade-off. The more constraints, the stronger the proof of authorship, but the higher overhead on the solution quality.
10.2
Preliminaries
10.2.1
Hardware and Computational Model
We selected as our computational model synchronous data flow (SDF) [Lee87]. The SDF is a special case of data flow in which the number of data samples produced or consumed by each node on each invocation is specified a priori. Nodes can be scheduled statically at compile time onto programmable processors. We restrict our attention to homogeneous SDF (HSDF), where each node consumes and produces exactly one sample on every execution. The HSDF model is well suited for specification of single task computations in numerous application domains such as DSP, communications, and multimedia. The syntax of a targeted computation is defined as a hierarchical control-data flow graph (CDFG) [Rab91]. The CDFG represents the computation as a flow graph, with nodes, data edges, and control edges. The semantics underlying the syntax of the CDFG format is that of the SDF flow model. All developed EC techniques can be applied successfully to other computation models such as the discrete event, communicating FSMs, synchronous/reactive, dataflow process network, and Petri net model [Ku92, Kif95, Edw97]. In addition, local watermarking is a generic approach and can be used for IPP of solutions or tools for many other combinatorial and continuous optimization problems.
193
10.2.2
Targeted Behavioral Synthesis Tasks
Behavioral synthesis transforms a given behavioral specification into an RTL description that can implement a given functionality. Behavioral synthesis encompasses a variety of tasks, such as operation scheduling, resource allocation and binding, design partitioning, template matching, and transformations. An overview of existing synthesis techniques can be found in [Gaj92, DeM94]. For the sake of brevity, we demonstrate the developed local watermarking technique only for two synthesis tasks: operation scheduling and template matching. Scheduling is the process of partitioning the set of operations in the CDFG into groups such that the operations in the same group can be executed concurrently in one control step, while taking into consideration possible trade-offs between total execution time and hardware cost. Scheduling determines the total number of control steps needed to execute all operations in the CDFG, the minimum number of functional modules for design, and the lifetimes of variables. For scheduling, there are two basic approaches: heuristics [Pau89] and integer linear programming (ILP) [Hwa91]. Template mapping is the process of mapping high level descriptions to hardware libraries or instruction sets which involves template matching and selection, and clock selection [Cor96]. The IMEC high level synthesis group was one of the first to attempt template matching by addressing the issues of application-specific functional units using ILP [Geu92, Not91]. Rao and Kurdahi proposed the use of template matching within a framework of regularity extraction for addressing partitioning [Rao92].
194
10.3
Global Flow: IPP for Behavioral Synthesis
The generic approach for protecting solutions to the above problems is shown in Figure 10.1. Watermarking of the original behavioral specification is performed in a synthesis pre-processing step. In that step, the spec is augmented with one or several sets of pseudo-randomized design constraints which encode the author’s signature. Each set is attached to a particular locality within the design spec. For example, in local watermarking of graph coloring solutions, the watermark is embedded in a design subgraph.
Initial design spec
Transformations
User signature
Behavioral synthesis
Watermarking
Template matching
Local watermarking
Watermarking
Additionally constrained design spec
Scheduling
Synthesis tool
Resource allocation Graph partitioning
Optimized spec
Figure 10.1: The global flow of the generic approach for local watermarking behavioral synthesis solutions. After the algorithm retrieves a solution to the given problem, the added constraints are removed from the optimized design spec, producing a design which satisfies both the user-specific and original constraints. The likelihood that an-
195
other algorithm applied to the original non-constrained design spec retrieves a solution which accidentally satisfies also the user-specific constraints (solution coincidence, Pc ) has to be small in order to have strong proof of authorship (1−Pc ). During copy detection, the goal is to find at least one set of added constraints in a particular design. Since each watermark has the property of being detectable within its own locality, the attacker is not safe even if the misappropriated solution is embedded or cut. Hence, a local watermark has to satisfy the following properties: • it has to be hardly recognizable in a given design, • it has to be difficult to find in a non-constrained design, • it has to be hard to remove by a finite set of solution transformations, • and it has to be easy to detect using exhaustive search and knowing the watermarks’ structure.
10.4
IPP Protocols for Behavioral Synthesis
In this section, we describe the technical details behind the data hiding, watermark detection, and attacking processes for the developed l ocal watermarking techniques for two behavioral synthesis tasks: operation scheduling and template matching.
10.4.1
Operation Scheduling
In scheduling, the set of arithmetic and logical operations in the CDFG are partitioned into groups of operations so that the operations in the same group can be
196
executed concurrently in one control step. The key steps in the local watermarking protocol for scheduling are: determination of a subtree sT ∈ CDF G which will contain the watermark, assignment of a unique identifier to each operation in sT , and critical-path-aware user-specific constraint encoding into sT using edges that determine temporal dependencies. Each of these procedures has to be performed according to a predetermined industry standard that both the author and attacker have to obey in court. In the remainder of this subsection, we propose a set of such procedures. They are formally presented using the pseudo-code in Figures 10.2, 10.3, and 10.4. Given a computation CDFG, the subtree sT that contains the watermark can be determined in a number of ways. A simple, but ineffective, way is to randomly select a node Groot ∈ CDF G and then select its fan-in tree of distance K to become sT . To strengthen the difficulty of tampering an existing watermark or finding an arbitrary watermark in a solution, we select sT of cardinality |sT | in the following way. First, we randomly select a node Groot ∈ CDF G and identify a subtree ST which represents a fan-in tree of max-distance |sT |. Next, we assign a unique identifier to each node in ST . The ordering routine uses a list of criteria to sort the nodes and determine their identifiers. The objective of the ordering function is to partition the set of nodes GST ∈ ST into an ordered set M of node partitions {Mi ∈ GST , i = 1 . . . |GST |} such that each partition contains exactly one node. Nodes that belong to partitions that contain more than one node are excluded from the ordering process. We propose the following list of four criteria for subtree partitioning and node identification (an example of similar ordering is given in [Kir98]): C[1] The level Li of node Gi with respect to Groot . A node Gi has a level K if the longest path in the CDFG from Groot to Gi is of cardinality K.
197
Given a CDF G = {G, C} with a set of nodes G = {G1 , . . . , GN } and a set of connections C = {C1 , . . . , CM } between nodes Randomly select a node Groot ∈ G and use it as a root to select a subtree ST with max-distance |sT | from Groot CDFG Node Ordering Assigns a unique identifier IDi to each node Gi ∈ ST Initiate the root of the breadth-first search to Groot Initiate RSAbitGen.seed = U DS Repeat |sT | times Gi = current node during breadth first search Allow pruning the breadth-first search tree for at least one input Gbf to Gi pseudo-randomly selected using RSAbitGen.rand() For each Gj ∈ inputs(Gi ) and Gj = Gbf in increasing order of input’s IDs. If RSAbitGen.rand() == 1 Add Gj to the list of nodes to traverse during breadth-first search Add Gi to sT Move to the next node in the breadth first search list End Repeat
Figure 10.2: Pseudo-code of the proposed protocol for local watermarking of operation scheduling solutions: Subtree sT identification. C[2] The number of nodes in the transitive fan-in of Gi at level K > Li . C[3] Functionality (including the constant if operation with a constant) and fanin of nodes in the transitive fan-in of Gi at level K > LINi . C[4] Functionality (including the constant if operation with a constant) and fanin of the fan-in of nodes in the transitive fan-in of Gi at level K < Li . Once all nodes are uniquely identified, subtree sT ∈ ST is selected using a
198
Given a CDF G = {G, C} with a set of nodes G = {G1 , . . . , GN } and a set of connections C = {C1 , . . . , CM } between nodes Determine CP the critical path in the CDFG For each node Gi ∈ sT If Gi has laxity of CP (1 − 4) and (∃Gj ∈ sT, Gj .asap + 1 < Gi .alap or Gi .asap + 1 < Gj .alap) Add Gi to subsT End For Use RSAbitGen.rand()[log2
|subsT | |W M subsT |
]
to point to a single selection W M subsT of all possible selections of |W M subsT | nodes from subsT For each Gi ∈ W M subsT Find a subset GI ∈ sT where each node GI ∈ GI, Gi = GI has GI .asap + 1 < Gi .alap or Gi .asap + 1 < GI .alap) Using RSAbitGen.rand() select one node GQ ∈ GI Draw a temporal edge between T Ei (Gi → GQ )
Figure 10.3: Pseudo-code of the proposed protocol for local watermarking of operation scheduling solutions: Constraint Encoding for Operation Scheduling. pseudo-random sequence of bits uniquely initiated by the author’s digital signature U DS. We use the RSA cryptographically secure one-way pseudo-random bit-generator [Men97] to generate this sequence. In order to determine sT , the watermarking procedure traverses the subtree ST in a top-down (in reverse direction of edges) breadth-first fashion. The author-unique sequence of bits at each N -input node determines (i) at least one input to include in the next level of breadth-first search; and (ii) whether each of the other inputs should be included or excluded from the list of succeeding nodes to be visited during the breadthfirst search. Note that the selection process cannot be misinterpreted because of the unique identification of each node input.
199
Given a CDF G = {G, C} with a set of nodes G = {G1 , . . . , GN } and a set of connections C = {C1 , . . . , CM } between nodes Ordered set M of sets of nodes M = {M0 = GST } For each Criteria Function C[i], i = 1..4 For each set of nodes Mi ∈ M with |Mi | > 1 For each node Gj ∈ Mi compute Gj .objective = C[i](Gj ) Partition Mi into an ordered set of unordered sets Mi,P1 , ..., Mi,PK such that all Gj ∈ Mi,Pk have the same Pk = Gj .objective and Pk > Pk+1 . Augment the new set of partitions into the initial set M in the following order ..., Mi−1 , Mi,1 , ..., Mi,K , Mi+1 , ... For each set of nodes Mi ∈ M with |Mi | > 1 Randomly partition Mi into an ordered set of unordered sets Mi,P1 , ..., Mi,PK each of cardinality equal to 1. Augment the new set of partitions into the initial set M in the following order ..., Mi−1 , Mi,1 , ..., Mi,K , Mi+1 , ...
Figure 10.4: Pseudo-code of the proposed protocol for local watermarking of operation scheduling solutions: CDFG Node Ordering. Then, the selected subtree sT is augmented with edges which indicate temporal dependencies between operations. Such edges are standard nomenclatures for behavioral descriptions (HYPER [Rab91]). They impose that the source operation is scheduled before the destination operation. The temporal edges are augmented according to the author’s digital signature (UDS) on the subset of subsT nodes of the subtree subsT ∈ sT . For each node Gi ∈ subsT exists at least one more node Gj ∈ subsT with overlapping periods {ASAPj + 1 > ALAPi or ASAPi + 1 < ALAPj }, where ALAP and ASAP are As Late As Possible and As Soon As Possible control steps respectively. Also, each node Gi ∈ subsT has laxity of CP ·(1−1), where CP is the length of the CDFG’s critical path and 1 is a
200
user/standard specified parameter. A node Gi has l axity of X if the longest path that contains Gi traverses the CDFG and has length of X. The restriction with respect to node’s overlapping ASAP-ALAP lifetimes and laxity is imposed to avoid significant timing overhead and more accurate proofs of authorship. If cardinality |subsT | is less than some predetermined K the entire process of subtree selection is repeated. Temporal edges are added to the ordered set of nodes subsT (using the sorted list of node identifiers) in the following way. We use a keyed RSA one-way function initialized with the user’s digital signature to generate a bitstream [Men97]. This bitstream is used to identify an ordered selection W M subsT ∈ subsT of |W M subsT | nodes from the list of all possible ordered selections of |W M subsT | nodes from subsT , where |W M subsT | is a parameter external to the watermarking procedure. Therefore, two different authors would have additional constraints imposed on two different sets of |W M subsT |-node selections. Finally, in the order of appearance, for each node Gi ∈ W M subsT , we identify a set GI ∈ sT of nodes where each node GI ∈ GI has overlapping life periods with Gi , {ASAPI + 1 > ALAPi or ASAPi + 1 < ALAPI }. Using the bit generator, the watermarking procedure selects one node GQ from GI and draws a temporal edge T Ei between T Ei (Gi → GQ ). The watermarking process is terminated when all |W M subsT | temporal edges are drawn. Note that for easier watermark detection, the subtree sT and its scheduling are memorized. At that point in the design process, the designer runs the scheduler to determine a locally watermarked solution to the original design spec. During the detection process, each node in the CDF G is visited and tested whether it can be a root of the memorized subtree sT and its scheduling. The key to the efficiency of such watermarking approach lies in the following
201
three facts. Firstly, for each CDFG the approximate likelihood of coincidence of finding an arbitrary watermark in a solution is equal to: Pc ≈ N odes ·
|W M subsT | i=1
T EW M (T Ei ) T EnonW M (T Ei )
where N odes is the number of nodes in the original design spec and T EW M (T Ei ) and T EnonW M (T Ei ) return respectively the number of possible schedulings of the T Ei source and destination nodes when T Ei is imposed and not. According to published results, we have assumed the Poisson distribution of the operation’s ASAP and ALAP times as well as that second order effects have negligible influence on the actual scheduling probabilities [Pau89]. Obviously, for large selected subtrees, the strength of the approximate proof of authorship {1 − Pc } can be made very strong. An example of determining the cardinalities T EW M (T Ei ) and T EnonW M (T Ei ) is presented in Figure 10.5. Two operations O[i] and O[j] can be scheduled in 77 different ways. However, there are only 10 possible schedulings how O[j] can be scheduled before O[i]. If T Ei has that direction, the corresponding cardinalities are T EW M (T Ei ) = 10 and T EnonW M (T Ei ) = 77. Given a scheduling solution quality (K control steps), the exact likelihood of coincidence is equal to the number of nodes in the original CDF G from which one can find the subtree sT times the number of solutions of quality K for the watermarked design spec divided by the number of solutions of quality K for the non-watermarked (non-constrained) design spec. Since the exhaustive enumeration of solutions frequently results in exponential run-times, we have used a trivial enumeration technique to calculate these probabilities only for smaller examples. The technique iterates through all viable combinations for scheduling operations between their ASAP and ALAP control steps. Secondly, the technique has solid resistance against tampering. The attacker
202
may try to modify the output locally in such a way that the watermark disappears or the proof of authorship is lowered below a predetermined standard. Thus, the watermarking scheme has to be such that, to delete the watermark and still preserve solution quality, the attacker has to perturb great deal of the obtained solution, forcing him/her to repeat the design process. For example, consider a design that has a total of 100,000 operations which satisfy the laxity requirement with 100 additional temporal edges imposed for watermarking purposes. Consider that the attacker aims to reduce the likelihood of authorship by doing local changes to the design. To reduce the proof of authorship to one in a million, under the assumption of average
T EW M (T Ei ) T EnonW M (T Ei )
= 12 , the attacker has to
alter the execution order of at least 31,729 pairs of nodes, i.e. alter 63% of the final solution. Thirdly, the one-way property of the random bitstream generator prohibits the attacker to locally modify the design in order to augment her/his signature. Namely, this finite set of modifications would require the knowledge of the inverse to the bitstream generator. Such easy-to-compute inverse function is not known for the RSA pseudo-random bit generator [Men97]. We demonstrate the developed protocol for local watermarking of scheduling solutions using a simple explanatory example: fourth order parallel IIR filter. The unscheduled CDF G for this filter structure is illustrated in Figure 10.7. An example subtree sT ∈ CDF G is presented at the bottom of Figure 10.5. Assuming that subsT = sT , the ordered set of temporal edge sources is C1, C2, C4, C7, A2 and the set of destination nodes is C3, C4, C8, C6, A3. For example, the temporal edge T E1 (C1 → C3) imposes that operation C1 should be executed before C3. The total number of scheduling solutions of the original sT subtree is 166, while only 15 solutions can be obtained when the additional constraints
203
D1
IN
D2
C1
*
D3
D4
ASAP[i]
ALAP[i]
Control Steps
C2
*
Operation[i] C3
* A2
A4
+ *
+
A1
C6 C5
* C4
+ * C7
Operation[j]
C8
*
ASAP[j]
ALAP[j]
A8
* +A3 +
Direction of temporal edge
Number of solutions
+ A6 + A5 A9
A7
+
+
O[i]
O[j]
2 x 11 + 10 + 9 + 8 + 7 + 6
O[i]
O[j]
4+3+2+1
O[i]
O[j]
5
Total D1
D2
OUT
A5 A1 A2
+
D4
D3
A9
77
+
+
+ +
+ A3
A6
+
A4
+
C2
+ C7
C3
A8
IN
C6 C1
A7
C8
C4
Figure 10.5: An example of local watermarking scheduling solutions: fourth order parallel IIR filter. are imposed. Thus, for this small example, the likelihood of solution coincidence is equal to Pc =
15 . 166
Obviously, Pc is in exponential correspondence with respect
to the CDF G cardinalities. An example of a final solution to the additionally constrained scheduling problem is shown in the upper left corner in Figure 10.5.
10.4.2
Template Matching
In template mapping at the behavioral level, groups of primitive operations are replaced with more complex and specialized hardware units which are designed
204
to implement common operations and are optimized for low area, power, or delay [Cor96]. The template mapping step involves template matching, template selection, and clock selection. In this paper, we address only the problem of local watermarking template matching solutions. A protocol for global watermarking of a variant of such a problem has been introduced by Kirovski et al [Kir98b]. Their constraint encoding technique assigns a signature-specific subset of circuit’s internal nodes to become pseudo-primary outputs (PPOs), thus, inducing the optimization algorithm to preserve these nodes as visible in the technology mapping solution. We introduce a novel approach for constraint encoding of mapping problems. The key idea guiding the new constraint encoding protocol is enforcement of node-module matchings by constraint manipulation in accordance with the user’s digital signature. Particular matchings are enforced to appear by assigning the nodes neighboring the matched module to become PPOs. In the remainder of this subsection, the watermarking protocol is explained in great technical detail and the approach is demonstrated using an explanatory example. The protocol for local watermarking of template matching solutions relies on the process of identifying a uniquely enumerated subtree sT of the original CDF G which will be augmented with user-specific constraints. The process of identifying the subtree sT is signature dependent, i.e. for a single signature only one subtree sT starting from a node in the CDF G can be determined. We have adopted the same sequence of steps for this procedure as implemented in the equivalent protocol for watermarking operation scheduling solutions. Therefore, we introduce only the constraint encoding protocol which assumes that the subtree sT has been identified and enumerated as described in the pseudo-code in Figure 10.2. The detailed description of this part of the developed protocol is formally presented using the pseudo-code in Figure 10.6.
205
Given a computation CDF G, a subset of nodes sT , and a library LIB with a set of modules Mi ∈ LIB, i = 1, . . . , |LIB|. Each operation Oi in each module Mk is uniquely enumerated. Each node Gi in the subtree sT has a unique identifier IDi . Constraint Encoding for Template Matching Repeat M AT CH times Compute CP critical path of the CDF G in terms of modules. subsT = remove all nodes from sT which have laxity in the range {CP · (1 − 4), CP } modules. List of all mappings LoM = null. For each node Gi ∈ sT in increasing order of IDi For each Oi ∈ Mk ∈ LIB that can be mapped into Gi Exhaustively try to map neighbors of Gi to all operations in Mk ∼ M atcho = {Gj → Oj , j = 1, . . . , |Mk |} If mapping M atcho ∈ LoM Add M atcho at the end of list LoM Select from LoM a mapping M atchk using RSAbitGen.rand() For each input and output Gi ∈ CDF G to M atchk Gi = P P O Remove Gi from sT End Repeat
Figure 10.6: Pseudo-code of the proposed protocol for constraint encoding during local watermarking of template matching solutions. The constraint encoding procedure imposes additional constraints on the selected subtree sT with the goal to isolate particular groups of operations that can be matched to the templates available from the library. An example of such matching is illustrated in Figure 10.7. In order to isolate the two-adder template matched with A5 and A6, variables P O1, P O2, and P O3 in the neighborhood are assigned to become PPOs. Since one of the inputs to A6 is a primary in-
206
put, it is not additionally constrained. Important fact is that any variable in the CDF G (not only the ones in the subtree sT ) can be part of the subset of variables that has to be promoted to PPOs. Addition of these constraints may affect the matchings along the critical path resulting in decreased solution quality. Therefore, from the selected subtree sT , we exclude all nodes that are on the critical path (of length CP modules) or paths of laxity within {CP · (1 − 1), CP } modules, where 1 is a parameter external to the watermarking protocol. This exclusion creates a new subset of nodes denoted as subsT ∈ sT .
+
IN
A1
A5
+
A9
+
OUT
D1 A2
C1
+
C5
+
A6
D2 C2
C6
A3
+
+
A7
D3 A4
C3
+
C7
+
A2
+
+
+ Templates
A5
A9
+
+
(PO2)
(PO5)
+
+ A3
A6
+
A4 C2
+
A7
(PO4) A8
C7 C3
+
(PO6)
+
C5 C1
+
C8
(PO3)
A1
T2
A8
D4 C4
(PO1)
T1
C8
C4
Figure 10.7: An example of local watermarking template matching solutions: fourth order parallel IIR filter.
207
The encoding procedure embeds the watermark iteratively in a loop which contains two steps. In the first step, given a subset of nodes subsT and a library with unique identifier for each operation in each module, all possible singletemplate matchings are exhaustively enumerated. The result of this enumeration is an ordered list LoM of matchings. For example, in the explanatory example, operation A9 can be matched in five different ways: as first addition in T1 , as second addition in T1 with no mapping for the first addition or as A5 or A7 as first additions, and as an addition in T2 . The goal of the enumeration procedure is to assign unique identifiers for each matching. This procedure is described using the pseudo-code in Figure 10.6. In the second step, the bitstream, produced by the RSA cryptographically secure one-way pseudo-random bit-generator [Men97] and seeded with author’s digital signature, is used to point to one of the matchings M atchi from LoM . Then, within the entire CDF G, all variables that are used as inputs/outputs to/from the operations covered by the module in the mapping M atchi are assigned to become PPOs. Next, all operations Gj ∈ M atchi are removed from sT . The subset of nodes subsT is updated by recomputing the laxity for each nodes in sT and then excluding the ones that have laxity within {CP · (1 − 1), CP } modules. Once the subset subsT is recomputed the constraint encoding loop is repeated. We treat the number of enforced matchings per subtree, M AT CH, as an external parameter to the protocol. The presented watermarking protocol is demonstrated using the fourth order parallel IIR filter. The CDFG of the filter and the available library of templates are presented in Figure 10.7. The watermarking process has isolated the following matchings {A5, A6}, {A9, A7}, and {A8, C7} (shaded matchings). The nonshaded matchings indicate a possible solution to this instance of the template
208
matching problem. The likelihood of solution coincidence for this protocol equals to the number of nodes in the original CDF G from which one can find the subtree sT times the number of solutions of quality K for the watermarked design spec divided by the number of solutions of quality K for the non-watermarked (non-constrained) design spec. Since this approach for computing Pc requires explicit enumeration of all possible solutions, which can be exponentially dependent upon the CDF G cardinalities, we opted to use an approximate technique for determining Pc . We define Pc∼ as: Pc∼ =
M AT CH i=1
1 Solutions(M atchi )
where Solutions(M atchi ) returns the number of different matchings for all nodes covered by the enforced template M atchi . For example, pair of nodes A5, A6 can be covered in the following six ways: A5
A6
A5
A6
A5
A6
A5, A9
A6
A5, A9
A6, C5
A1, A5
A6
A1, A5
A6, C5
A1
A6, C5
A5, A6
A5, A6
Finally, note that this protocol has resistance to tampering proportional to the protocol presented in [Kir98]. Similarly, both presented local watermarking protocols are resistant to the attack scenarios presented in [Kir98].
10.5
Experimental Results
We have conducted a set of experiments in order to evaluate the efficacy of local watermarking on the operation scheduling and template matching tasks.
209
Although the local watermarking technique is presented using the SDF computation model, we have tested its efficiency on a set of benchmark programs specified in C, i.e. standard RAM computation model [Aho83]. The dependencies were induced using additional operations with unit operators (e.g. additions with variables assigned to zero at run-time). Note that in the actual implementation the added instruction should be extracted from binaries for security and performance reasons. All programs were compiled for a four-issue VLIW machine with four ALUs, two branch and two memory units, and 8KB cache [Lee98]. The code was compiled for the described machine using the retargetable IMPACT C compiler [Cha91]. We have watermarked the applications collected from the MediaBench set of benchmarks [Lee97]. The obtained results for operation scheduling are presented in Table 10.1. The first two columns present the name of the application and its number of operations N . For each application, we have augmented local watermarks within a subtree of cardinality |sT | = 10 · α · N, α = 0.2, 0.5. In columns three and five, we present the likelihood of solution coincidence Pc∼ for |W M subsT | = 0.2 · |sT |. Columns four and six demonstrate the percentage of increase of execution time induced by the augmented code due to watermarking. As presented, all IPP properties enabled by local watermarking were provided with negligible performance overhead. The results of experiments conducted to test the template matching algorithm are presented in Table 10.2. The local watermarking techniques for template matching were tested on a set of small real-life designs [Rab91]. We used HYPER as a behavioral synthesis tool [Rab91]. Columns 1-4 present the design’s description, number of available control steps, critical path, and number of variables. Column 5 quantifies the percentage β of templates that were enforced M AT CH = 0.07 · |sT |, sT = CDF G. Finally, column 6 presents the percentage of increase of the count of used modules to cover the entire design with respect to
210
Application
Operation Scheduling
Descri-
Oper-
2% nodes constrd
ption
ations
Pc∼
Perf. OH
Pc∼
Perf. OH
D/A Cnv.
528
10−26
0.5%
10−53
1.5%
G721
758
−27
10
0.7%
−67
10
1.7%
epic
872
10−39
0.6%
10−91
2.4%
PEGWIT
658
10−27
0.2%
10−73
1.1%
PGP
1755
10−89
-0.1%
10−283
0.5%
GSM
802
10−34
0.3%
10−87
1.4%
−212
10
0.2%
10−185
0.4%
−65
JPEG.c
1422
10
MPEG2.d
1372
10−58
0% 0.2%
5% nodes constrd
Table 10.1: Experimental results describing the efficiency of applied local watermarking protocols to operation scheduling. two design strategies: non-watermarked and watermarked. Knowing the simplicity of target benchmark designs, the order of the likelihood of design coincidence has ranged from 10−5 to 10−27 for all specified designs. Therefore, note that in both the operation scheduling and template matching task, local watermarking has showed as an effective IPP methodology providing partial protection with low overhead and high confidence and reliability.
10.6
Conclusion
We have introduced l ocal watermarking, an IPP technique that, besides the standard set of VSI requirements, enables protection of design partitions, provides an easy procedure for watermark detection, and enables detection of watermarks when the misappropriated design or its part is augmented into another larger design. We have applied the new IPP technology to a subset of behavioral synthesis
211
tasks: operation scheduling and template matching. We have demonstrated that the difficulty of erasing author’s signature or finding another signature in the synthesized design can be made arbitrarily computationally difficult. The watermarking method has been experimented on a set of benchmarks, where high likelihood of authorship has been achieved with negligible overhead in solution quality. Design Description
Template Matching
Available
Critical
Vari-
%
Overhead
control
path
ables
mod.
module
enf.
count
steps 8th Order
18
18
35
3%
8.2%
CF IIR
36
18
35
3%
3.3%
Linear GE
12
12
48
5%
11.1%
Cntrlr
24
12
48
5%
5%
Wavelet
16
16
31
4%
10%
Filter
32
16
31
4%
3.3%
Modem
10
10
33
5%
8.7%
Filter
20
10
33
5%
2.5%
Volterra
12
12
28
5%
8.7%
2nd ord.
12
24
28
5%
6%
Volterra
20
20
50
3%
9%
3rd non-lin.
20
40
50
3%
5.2%
D/A
132
132
354
4%
3%
Converter
132
264
354
4%
0.4%
Long Echo
2566
2566
1082
2%
1%
Canceler
5132
2566
1082
2%
0.1%
Table 10.2: Experimental results describing the efficiency of applied local watermarking protocols to template matching.
212
CHAPTER 11 Forensic Engineering Techniques for VLSI CAD Tools The proliferation of the Internet has affected the business model of almost all semiconductor and VLSI CAD companies that rely on intellectual property (IP) as their main source of revenues. The fact that IP has become more accessible and easily transferable, has influenced the emergence of copyright infringement as one of the most common obstructions to e-commerce of IP. In this paper, we propose a generic forensic engineering technique that addresses a number of copyright infringement scenarios. Given a solution SP to a particular optimization problem instance P and a finite set of algorithms A applicable to P , the goal is to identify with a certain degree of confidence the algorithm Ai which has been applied to P in order to obtain SP . We have applied forensic analysis principles to two problem instances commonly encountered in VLSI CAD: graph coloring and boolean satisfiability. We have demonstrated that solutions produced by strategically different algorithms can be associated with their corresponding algorithms with high accuracy.
213
11.1
Introduction
The emergence of the Internet as the global communication paradigm, has enforced almost all semiconductor and VLSI CAD companies to market their intellectual property on-line. Currently, companies such as ARM Holdings [Arm99], LSI Logic [Lsi99], and MIPS [Mip99], mainly constrain their on-line presence to sales and technical support. However, in the near future, it is expected that both core and synthesis tools developers place their IP on-line in order to enable modern hardware and software licensing models. There is a wide consensus among the software giants (Microsoft, Oracle, Sun, etc.) that the rental of downloadable software will be their dominating business model in the new millennium [Mic99]. It is expected that similar licensing models become widely accepted among VLSI CAD companies. Most of the CAD companies planning on-line IP services believe that copyright infringement will be the main negative consequence of IP exposure. This expectation has its strong background in an already ”hot” arena of legal disputes in the industry. In the past couple of years, a number of copyright infringement lawsuits have been filed: Cadence vs. Avant! [EET99], Symantec vs. McAfee [IW99], Gambit vs. Silicon Valley Research [GCW99], and Verity vs. Lotus Development [IDG99]. In many cases, the concerns of the plaintiffs were related to the violation of patent rights frequently accompanied with misappropriation of implemented software or hardware libraries. Needless to say, court rulings and secret settlements have impacted the market capitalization of these companies enormously. In many cases, proving legal obstruction has been a major obstacle in reaching a fair and convincing verdict [Mot99, Afc99]. In order to address this important issue, we propose a set of techniques for the forensic analysis of design solutions. Although the variety of copyright in-
214
fringement scenarios is broad, we target a relatively generic case. The goal of our generic paradigm is to identify one from a pool of synthesis tools that has been used to generate a particular optimized design. More formally, given a solution SP to a particular optimization problem instance P and a finite set of algorithms A applicable to P , the goal is to identify with a certain degree of confidence that algorithm Ai has been applied to P in order to obtain solution SP . In such a scenario, forensic analysis is conducted based on the likelihood that a design solution, obtained by a particular algorithm, results in characteristic values for a predetermined set of solution properties. Solution analysis is performed in three steps: collection of statistical data, clustering of heuristic properties for each analyzed algorithm, and decision making with confidence quantification. In order to demonstrate the generic forensic analysis platform, we propose a set of techniques for forensic analysis of solution instances for a set of problems commonly encountered in VLSI CAD: graph coloring and boolean satisfiability. We have conducted a number of experiments on real-life and abstract benchmarks to show that using our methodology, solutions produced by strategically different algorithms can be associated with their corresponding algorithms with relatively high accuracy.
11.2
Existing Methods for Establishing Copyright Infringement
In this subsection, we present an overview of techniques used in court to distinguish substantial similarity between a copyright protected design or program and its replica. The dispositive issue in copyright law is the idea-expression dichotomy, which
215
specifies that any idea (system) of operation (concept), regardless of the form in which it is described, is unprotectable [McG95]. Copyright protection extends only to the expression of ideas, not the ideas themselves. Although courts have fairly effective procedures for distinguishing ideas from expressions [McG95], they lack persuasive methods for quantifying substantial similarity between expressions; a necessary requirement for establishing a case of copyright infringement. Since modern reverse engineering techniques have made both hardware [Tae99] and software [Beh98] vulnerable to partial resynthesis, frequently, plaintiffs have problems identifying the degree of infringement. Methods used by courts to detect infringement are currently still rudimentary. The three most common tests: the “ordinary observer test”, the extrinsic/intrinsic test, and the “total concept and feel test” are used in cases when it is easy to detect a complete copy of a design or a program’s source code [McG95]. The widely adopted “iterative approach” enables better abstraction of the problem by requiring: (i) substantial similarity and a proof of copying or access and (ii) proof that the infringing work is an exact duplication of substantial portions of the copyrighted work [McG95]. Obviously, neither of the tests addresses the common case in contemporary industrial espionage, where stolen IP is either hard to abstract from synthesized designs or difficult to correlate to the original because of a number of straightforward modifications which are hard to trace back. For instance, performing peephole optimizations can alter a solution to an existing optimization problem in such a way that the end product does not resemble the original design. This issue is highly important for VLSI CAD tool developers, due to the difficulty of rationalizing similarities between different or slightly modified synthesis algorithms. For example, a probabilistic partitioning engine would create different partitions for the same graph instance, if only the seed of the random number generator is altered. Similarly, a constructive graph
216
coloring algorithm is likely to yield a different coloring for a graph with permuted node ordering.
11.3
Forensic Engineering: The New Generic Approach
In this section, we introduce generic forensic engineering techniques that can be used to obtain fair rulings in copyright infringement cases. Forensic engineering aims at providing both qualitative and quantitative evidence of substantial similarity between the design original and its copy. The generic problem that a forensic engineering methodology tries to resolve can be formally defined as follows. Given a solution SP to a particular optimization problem instance P and a finite set of algorithms A applicable to P , the goal is to identify with a certain degree of confidence which algorithm Ai has been applied to P in order to obtain solution SP . An additional restriction is that the algorithms (their software or hardware implementations) have to be analyzed as black boxes. This requirement is based on two facts: ( i) similar algorithms can have different executables and ( ii) parties involved in the ruling are not eager to reveal their IP even in court.
Original problem instance P
Perturbations
Statistics Collection
Separate histogram χ (π ,A) for each property π and each algorithm A
Algorithm 1
Clustering of algorithms
Analysis
Algorithm 2 Isomorphic problem variants of P
Algorithm N
Solution provided for each problem instance P and algorithm A
Decision making
Figure 11.1: Global flow of the forensic engineering methodology.
217
The global flow of the generic forensic engineering approach is presented in Figure 11.1. It consists of three fully modular phases: Statistics collection. Initially, each algorithm Ai ∈ A is applied to a large number of isomorphic representations Pj , j = 1 . . . N of the original problem instance P . Note that “isomorphism” indicates pseudo-random perturbation of the original problem instance P . Then, for each obtained solution SPi j , i = 1 . . . |A|, j = 1 . . . M , an analysis program computes the values ωki,j , k = 1 . . . L for a particular set of solution’s properties πk , k = 1 . . . L. The reasoning behind performing iterative optimizations of perturbed problem instances is to obtain a valid statistical model on certain properties of solutions generated by a particular algorithm. Next, the collected statistical data (ωki,j ) is integrated into a separate histogram χik for each property πk under the application of a particular algorithm Ai . Since the probability distribution function for χik is in general not known, using non-parametric statistical methods [DeG89], each algorithm Ai is associated with probability pχik =X that its solution results in property πk being equal to X. Algorithm clustering. In order to associate an algorithm Ax ∈ A with the original solution SP , the set of algorithms is clustered according to the properties of SP . The value ωkSP for each property πk of SP is then compared to the collected histograms (χik , χjk ) of each pair of considered algorithms Ai and Aj . Two algorithms Ai , Aj remain in the same cluster, if the likelihood zAi ,Aj ,ωSP that their K
properties are not correlated is greater than some predetermined bound 1 ≤ 1 (K is the index of the property πK , which induces the highest anti-correspondence between the two algorithms). S
likelihood(πki =ωk P )
|π|
zAi ,Aj ,ωSP = maxk=1 K
S
S
likelihood(πki =ωk P )+likelihood(πkj =ωk P )
It is important to stress that a set of properties associated with algorithm Ai
218
can be correlated with more than one cluster of algorithms. For instance, this can happen when an algorithm Ai is a blend of two different heuristics (Aj , Ak ) and therefore its properties can be statistically similar to the properties of Aj , Ak . Obviously, in such cases exploration of different properties or more expensive and complex structural analysis of programs is the only solution. Decision making. This process is straightforward. If the plaintiff’s algorithm Ax is clustered jointly with the defendant’s algorithm Ay and Ay is not clustered with any other algorithm from A, substantial similarity between the two algorithms is positively detected at a degree quantified using the parameter zAx ,Ay ,ωSP . K
The court may adjoin to the experiment several slightly modified replicas of Ax as well as a number of strategically different algorithms from Ax in order to validate that the value of zAx ,Ay ,ωSP points to the correct conclusion. K
Obviously, the selection of properties plays an important role in the entire system. Two obvious candidates are the actual quality of solution and the runtime of the optimization program. Needless to say, such properties may be a decisive factor only in specific cases when copyright infringement has not occured. Only detailed analysis of solution structures can give useful forensic insights. In the remainder of this manuscript, we demonstrate how such analysis can be performed for graph coloring and boolean satisfiability.
11.4
Forensic Engineering: Statistics Collection
11.4.1
Graph Coloring
We present the developed forensic engineering methodology using the problem of graph K-colorability. In order to position the proposed approach, initially, we formalize the optimization problem and then survey a number of existing widely
219
accepted heuristics. Finally, we propose a set of heuristic properties that can be used to correlate individual graph coloring solutions to their algorithms. Since many resource assignment problems can be modeled using graph coloring, its applications in VLSI CAD are numerous (logic minimization, register assignment, cache line coloring, circuit testing, operations scheduling [Cou97]). The problem can be formally described using the following standard format: PROBLEM: GRAPH K-COLORABILITY INSTANCE: Graph G(V, E), positive integer K ≤ |V |. QUESTION: I s G K-colorable. i.e., does there exist a function f : V → 1, 2, 3, .., K such that f (u) = f (v) whenever u, v ∈ E?
In general, graph coloring is an NP-complete problem [Gar79]. Particular instances of the problem that can be solved in polynomial time are listed in [Gar79]. For instance, graphs with maximum vertex degree less than four, and bipartite graphs can be colored in polynomial time. Due to its applicability, a number of exact and heuristic algorithms for graph coloring has been developed to date. For brevity and due to limited source code availability, in this paper, we constrain our research to a few of them. The simplest constructive algorithm for graph coloring is the ”sequential” coloring algorithm (SEQ). SEQ sequentially traverses and colors vertices with the lowest index not used by the already colored neighboring vertices. DSATUR [Bre79] colors the next vertex with a color C selected depending on the number of neighbor vertices already connected to nodes colored with C (saturation degree) (Figure 11.2). RLF [Lei79] colors the vertices sequentially one color class at a time. Vertices colored with one color represent an independent subset (IS) of the graph. The algorithm tries to color with each color maximum number of vertices. Since the problem of finding the maximum IS is intractable [Gar79], a heuristic is em-
220
ployed to select a vertex to join the current IS as the one with the largest number of neighbors already connected to that IS. An example how RLF colors graphs is presented in Figure 11.3. Node 6 is randomly selected as the first node in the first IS. Two nodes (2,4) have maximum number of neighbors which are also neighbors to the current IS. The node with the maximum degree is chosen (4). Node 2 is the remaining vertex that can join the first IS. The second IS consists of randomly selected node 1 and the only remaining candidate to join the second IS, node 5. Finally, node 3 represents the last IS. max degree =4
max satur degree = 1 & max degree = 2
max satur degree = 1
max satur degree = 1
max satur degree = 1 & max degree = 3
max satur degree = 1 & max degree = 2
max satur degree = 2
lower order color
max satur degree = 2 & max degree = 3
Figure 11.2: Example of the DSATUR algorithm. Iterative improvement techniques try to, using various search techniques, find better colorings usually generating successive colorings by random moves. The most common search techniques are simulated annealing [Mor86, Joh91, Mor94] and tabu search [dWe85, Fle96]. In our experiments, we will constrain the pool of
221
algorithms A to a greedy, DSATUR, MAXIS (RLF based), backtrack DSATUR, iterated greedy, and tabu search (descriptions and source code at [Cul99]).
6
6
2
1 Neighbors
2
1 Neighbors
3 3
5
5
4
4 1
1 2
6
2
6
Neighbor
3
5
3 5 4
4
Figure 11.3: Example of the RLF algorithm. A succesful forensic technique should be able to, given a colored graph, distinguish whether a particular algorithm has been used to obtain the solution. The key to the efficiency of the forensic method is the selection of properties used to quantify algorithm-solution correlation. We propose a list of properties that aim at analyzing the structure of the solution: [π1 ] Color class size. Histogram of IS cardinalities is used to filter greedy algorithms that focus on coloring graphs constructively (e.g. RLF-like algorithms). Such algorithms tend to create large initial independent sets at the beginning of their coloring process. [π2 ] Number of edges in large independent sets. This property is used to aid
222
the accuracy of π1 by excluding easy-to-find large independent sets from consideration in the analysis. [π3 ] Number of edges that can switch color classes. This criteria analyzes the quality of the coloring. Good coloring result will have fewer nodes that are able to switch color classes. It also characterizes the greediness of an algorithm because greedy algorithms commonly create at the end of their coloring process many color classes that can absorb large portion of the remaining graph. [π4 ] Color saturation in neighborhoods. This property assumes creation of a histogram that counts for each vertex the number of adjacent nodes colored with one color. Greedy algorithms and algorithms that tend to sequentially traverse and color vertices are more likely to have node neighborhoods dominated by fewer colors. [π5 ] Sum of degrees of nodes included in the largest (smallest) color classes. This property aims at identifying algorithms that perform peephole optimizations, since they are not likely to create color classes with high-degree vertices. [π6 ] Sum of degrees of nodes adjacent to the vertices included in the largest (smallest) color classes. The analysis goal of this property is similar to π5 with the exception that it focuses on selecting algorithms that perform neighborhood lookahead techniques [Kir98gc]. [π7 ] Percent of maximal independent subsets. This property can be highly effective in distinguishing algorithms that color graphs by iterative color class selection (RLF). Supplemented with property π3 , it aims at detecting fine nuances among similar RLF-like algorithms.
223
The itemized properties can be effective only on large instances where the standard deviation of histogram values is relatively small. Using standard statistical approaches [DeG89], the function of standard deviation for each histogram can be used to determine the standard error incorporated in the reached conclusion. 4
2
3
8 6
1
1
8 4 6
5 3
7 5
7
DSATUR generated solution
2
RLF generated solution
Figure 11.4: Example of two different graph coloring solutions obtained by two algorithms DSATUR and RLF. The index of each vertex specifies the order in which it is colored according to a particular algorithm. Although instances with small cardinalities cannot be a target of forensic methods, we use a graph instance in Figure 11.4 to illustrate how two different graph coloring algorithms tend to have solutions characterized with different properties. The applied algorithms are DSATUR and RLF (described earlier in the section). Specified algorithms color the graph constructively in the order denoted in the figure. If property π1 is considered, the solution created using UR = {12 , 23 , 04 }, where histogram value xy deDSATUR has a histogram χDSAT π1
notes x sets of color classes with cardinality y. Similarly, the solution created using RLF results in χRLF = {22 , 03 , 14 }. Commonly, extreme values point to π1 the optimization goal of the algorithm or characteristic structure property of its
224
solutions. In this case, RLF has found a maximum independent set of cardinality y = 4, a consequence of algorithm’s strategy to search in a greedy fashion for maximal ISs.
11.4.2
Boolean Satisfiability
We illustrate the key ideas of watermarking-based intellectual property protection techniques using the SAT problem. The SAT problem can be defined in the following way [Gar79]: Problem: SATISFIABILITY (SAT) Instance: A set of variables V and a collection C of clauses over V . Question: I s there a truth assignment for V that satisfies all the clauses in C? For instance, V = {v1 , v2 } and C = {{v1 , v2 }, {v1 }, {v1 , v2 }} is an instance of SAT for which the answer is positive. A satisfying truth assignment is t(v1 ) = F and t(v2 ) = T . On the other hand, if we have collection C = {{v1 , v2 }, {v1 }} there is no satisfying solution. Boolean satisfiability is an NP-complete problem [Gar79]. It has been proven that every other problem in NP can be polynomially reduced to satisfiability [Coo71, Kar72]. SAT has an exceptionally wide application range. Many problems in CAD are often modeled as SAT instances. For example, SAT techniques have been used in testing [Sil97, Ste96, Cha93, Kon93], logic synthesis, and physical design [Dev89]. There are at least three broad classes of solution strategies for the SAT problem. The first class of techniques are based on probabilistic search [Gu99, Sil99, Sel95, Dav60], the second are approximation techniques based on rounding the solution to a nonlinear program relaxation [Goe95], and the third is
225
a great variety of BDD-based techniques [Bry95]. For brevity and due to limited source code availability, we demonstrate our forensic engineering technology on the following SAT algorithms. • GSAT identifies for each variable v the difference DIFF between the number of clauses currently unsatisfied that would become satisfied if the truth value of v were reversed and the number of clauses currently satisfied that would become unsatisfied if the truth value of v were flipped [Sel92, Sel93, Sel93a]. The algorithm pseudo-randomly flips assignments of variables with the greatest DIFF. • WalkSAT Selects with probability p a variable occurring in some unsatisfied clause and flips its truth assignment. Conversely, with probability 1 − p, the algorithm performs a greedy heuristic such as GSAT [Sel93a]. • NTAB performs a local search to determine weights for the clauses (intuitively giving higher weights corresponds to clauses which are harder to satisfy). The clause weights are then used to preferentially branch on variables that occur more often in clauses with higher weights [Cra96]. • Rel SAT rand represents an enhancement of GSAT with look-back techniques [Bay96]. In order to correlate an SAT solution to its corresponding algorithm, we have explored the following properties of the solution structure. [π1 ] Percentage of non-important variables. A variable vi is non-important for a particular set of clauses C and satisfactory truth assignment t(V ) of all variables in V , if both assignments t(vi ) = T and t(vi ) = F result in satisfied C. For a given truth assignment t, we denote the subset of variables that
226
can switch their assignment without impact on the satisfiability of C as VNt I . In the remaining set of properties only functionally significant subset of variables V0 = V − VNt I is considered for further forensic analysis. [π2 ] Clausal stability - percentage of variables that can switch their assignment such that K% of clauses in C are still satisfied. This property aims at identifying constructive greedy algorithms, since they assign values to variables such that as many as possible clauses are covered with each variable selection. [π3 ] Ratio of true assigned variables vs. total number of variables in a clause. Although this property depends by and large on the structure of the problem, in general, it aims at qualifying the effectiveness of the algorithm. Large values commonly indicate usage of algorithms that try to optimize the coverage using each variable. [π4 ] Ratio of coverage using positive and negative appearance of a variable. While property π3 analyzes the solution from a perspective of a single clause, this property analyzes the solution from a perspective of each variable. Each variable vi appears in pi clauses as positively and ni clauses as negatively inclined. The property quantifies the possibility that an algorithm assigns a truth value to t(vi ) = pi ≥ ni . [π5 ] The GSAT heuristic. For each variable v the difference DIFF=a-b is computed, where a is the number of clauses currently unsatisfied that would become satisfied if the truth value of v were reversed, and b is the number of clauses currently satisfied that would become unsatisfied if the truth value of v were flipped. As in the case of graph coloring, the listed properties demonstrate signifi-
227
cant statistical proof only for large problem instances. Instances should be large enough to result in low standard deviation of collected statistical data. Standard deviation impacts the decision making process according to the Central Limit Theorem [DeG89].
11.5
Forensic Engineering: Algorithm Clustering and Decision Making
Once statistical data is collected, algorithms in the initial pool are partitioned into clusters. The goal of partitioning is to join strategically similar algorithms (e.g. with similar properties) in a single cluster. This procedure is presented formally using the pseudo-code in Figure 11.5. The clustering process is initiated by setting the starting set of clusters to empty C = ∅. In order to associate an algorithm Ax ∈ A with the original solution SP , the set of algorithms is clustered according to the properties of SP . The value ωkSP for each property πk of SP is then compared to the collected histograms (χik , χjk ) of each pair of considered algorithms Ai and Aj . Two algorithms Ai , Aj remain in the same cluster, if the likelihood zAi ,Aj ,ωSP that their properties are K
not correlated is greater than some predetermined bound 1 ≤ 1 (K is the index of the property πK , which induces extreme anti-correspondence between the two algorithms). |π|
zAi ,Aj ,ωSP = maxk=1 K
S
likelihood(πki =ωk P )
S likelihood(πkm =ωk P
S
)+likelihood(πkm =ωk P )
The function that computes the mutual correlation of two algorithms takes into account the fact that two properties can be mutually dependent. Algorithm Ai is added to a cluster Ck if its correlation with all algorithms in Ck is greater
228
than some predetermined bound 1 ≤ 1. If Ai cannot be highly correlated with any algorithm from all existing clusters in C then a new cluster C|C|+1 is created with Ai as its only member and added to C. If there exists a cluster Ck for which Ai is highly correlated with a subset CkH of algorithms within Ck , then Ck is partitioned into two new clusters CkH ∪ Ai and Ck − CkH . Finally, algorithm Ai is removed from the list of unprocessed algorithms A. These steps are iteratively repeated until all algorithms are processed. Given A. C = ∅. For each Ai ∈ A For each Ck ∈ C add = true; none = true For each Aj ∈ Ck If zA
SP i ,Aj ,ωK
≥ 4.
Then add = false Else none = false End For If add Then merge Ai with Ck Else create new cluster C|C|+1 with Ai as its only element. If none Then create two new clusters CkH ∪ Ai and Ck − CkH where CkH ∈ Ck is a subset of algorithms highly correlated with Ai . End For End For
Figure 11.5: Forensic engineering: pseudo-code for the algorithm clustering procedure. Obviously, according to this procedure, an algorithm Ai can be correlated with two different algorithms Aj , Ak that are not mutually correlated (as presented
229
in Figure 11.6). For instance, this situation can occur when an algorithm Ai is a blend of two different heuristics (Aj , Ak ) and therefore its properties can be statistically similar to the properties of Aj , Ak . In such cases, exploration of different properties or more expensive and complex structural analysis of algorithm implementations is the only solution to detecting copyright infringement. -5 z=10
A1
-3 z=10
A2
A1
A2
A1 -1 z=10
-5 z=10
-1 z=10
A3
A2
A3
A3
-1 z=10
A3
Figure 11.6: Two different examples of clustering three distinct algorithms. The first clustering (figure on the left) recognizes substantial similarity between algorithms A1 and A3 and substantial dissimilarity of A2 with respect to A1 and A3 . Accordingly, in the second clustering (figure on the right) the algorithm A3 is recognized as similar to both algorithms A1 and A2 , which were found to be dissimilar. Once the algorithms are clustered, the decision making process is straightforward. • If plaintiff’s algorithm Ax is clustered jointly with the defendant’s algorithm Ay , • and Ay is not clustered with any other algorithm from A which has been previously determined as strategically different, • then substantial similarity between the two algorithms is positively detected at a degree quantified using the parameter zAx ,Ay ,ωSP . K
230
The court may adjoin to the experiment several slightly modified replicas of Ax as well as a number of strategically different algorithms from Ax in order to validate that the value of zAx ,Ay ,ωSP points to the correct conclusion. k
11.6
Experimental Results
In order to demonstrate the effectiveness of the proposed forensic methodologies, we have conducted a set of experiments on both abstract and real-life problem instances. In this section, we present the obtained results for a large number of graph coloring and SAT instances. The collected data is partially presented in Figure 11.7. Each subfigure in Figure 11.7 (spans over 8 pages at the end of this section) represents the following comparison (from upper to bottom): (1,3) π1 and NTAB, Rel SAT, and WalkSAT and (2,4) then zoomed version of the same property with only Rel SAT, and WalkSAT (for two different sets of instances - total: first four subfigures), (5,6,7) π2 for NTAB, Rel SAT, and WalkSAT, and (8,9,10) π3 for NTAB, Rel SAT, and WalkSAT respectively. The last five subfigures depict the histograms of property value distribution for the following pairs of algorithms and properties: (11) DSATUR with backtracking vs. maxis and π3 , (12) DSATUR with backtracking vs. tabu search and π7 , (13,14) iterative greedy vs. maxis and π1 and π4 , and (15) maxis vs. tabu and π1 . It is important to stress, that for the sake of external similarity among algorithms, we have adjusted the run-times of all algorithms such that their solutions are of approximately equal quality. We have focused our forensic exploration of graph coloring solutions on two sets of instances: random (1000 nodes and 0.5 edge existence probability [Joh91]) and register allocation graphs. The last five subfigures in Figure 11.7 depict the histograms of property value distribution for the following pairs of algorithms
231
and properties: DSATUR with backtracking vs. maxis and π3 , DSATUR with backtracking vs. tabu search and π7 , iterative greedy vs. maxis and π1 and π4 , and maxis vs. tabu and π1 respectively. Each of the diagrams can be used to associate a particular solution with one of the two algorithms A1 and A2 with 1% accuracy (100 instances attempted for statistics collection). For a given property value πi = x (X-dimension), a test instance can be associated to algorithm A1 with likelihood equal to the ratio of the Y-dimensions of the histogram for
A1 (x) . A2 (x)
For the complete set of instances
and algorithms that we have explored, as it can be observed from the diagrams, on the average, we have succeeded to associate 90% of solution instances with their corresponding algorithms with probability greater than 0.95. According to the Central Limit Theorem [DeG89] in one half of the cases, we have achieved association likelihood better than 1 − 10−6 . The forensic analysis techniques, that we have developed for solutions to SAT instances, have been tested using a real-life (circuit testing) and an abstract benchmark set of instances adopted from [Kam93, Tsu93]. Parts of the collected statistics are presented in the first ten subfigures in Figure 11.7. The subfigures represent the following comparisons: π1 and NTAB, Rel SAT, and WalkSAT and then zoomed version of the same property with only Rel SAT, and WalkSAT (for two different sets of instances - total: first four subfigures), π2 for NTAB, Rel SAT, and WalkSAT, and π3 for NTAB, Rel SAT, and WalkSAT respectively. The diagrams clearly indicate that solutions provided by NTAB can be easily distinguished from solutions provided by the other two algorithms using any of the three properties. However, solutions provided by Rel SAT, and WalkSAT appear to be similar in structure (which is expected because they both use GSAT as the heuristic guidance for their propositional search). We have succeeded to
232
differentiate their solutions on per instance basis. For example, in the second subfigure it can be noticed that solutions provided by Rel SAT have much wider range for π1 and therefore, according to the second subfigure, approximately 50% of its solutions can be easily distinguished from WalkSAT’s solutions with high probability. Significantly better results were obtained using another set of structurally different instances (zoomed comparison presented in the fourth subfigure), where among 100 solution instances no overlap in the value of property π1 was detected for Rel SAT, and WalkSAT.
11.7
Conclusion
With the emergence of the Internet, intellectual property has become accessible and easily transferable. The improvements in product delivery and maintenance have a negative side-effect: copyright infringement has become one of the most commonly feared obstacles to IP e-commerce. We have proposed a forensic engineering technique that addresses the generic copyright infringement scenario. Given a solution SP to a particular optimization problem instance P and a finite set of algorithms A applicable to P , the goal is to identify with certain degree of confidence the algorithm Ai which has been applied to P in order to obtain SP . The application of the forensic analysis principles to graph coloring and boolean satisfiability has demonstrated that solutions produced by strategically different algorithms can be associated with their corresponding algorithms with high accuracy. Figure 11.7: Experimental results obtained for forensic engineering of graph coloring and SAT. Figure continues over next 8 pages. Detailed explanation of each figure in the experimental results subsection.
233
Figure 11.7 continued. . . percent_NIV: NTAB(blue), W ALKSAT(red), RELSATR(green) 45 40 35
Frequency
30 25 20 15 10 5 0 0.74
0.76
0.78
0.8
0.82 0.84 Value
0.86
0.88
0.9
percent_NIV: NTAB(blue), W ALKSAT(red), RELSATR(green) 15
Frequency
10
5
0 0.75
0.76
0.77
0.78 0.79 Value
234
0.8
0.81
0.82
0.92
Figure 11.7 continued. . . percent_NIV: NTAB(blue), W ALKSAT(red), RELSATR(green) 100 90 80
Frequency
70 60 50 40 30 20 10 0 0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
Value
percent_NIV: NTAB(blue), W ALKSAT(red), RELSATR(green)
90 80 70
Frequency
60 50 40 30 20 10 0 0.51
0.515
0.52
0.525
0.53 Value
235
0.535
0.54
0.545
Figure 11.7 continued. . . clausal_stability: NTAB 800 700 600
Frequency
500 400 300 200 100 0
0
0.005
0.01
0.015 Value
0.02
0.025
0.03
clausal_stability: RELSATR 800 700 600
Frequency
500 400 300 200 100 0
0
0.1
0.2
0.3
0.4
0.5 Value
236
0.6
0.7
0.8
0.9
1
Figure 11.7 continued. . . claus al_stability : W ALKSAT 800 700 600
Frequency
500 400 300 200 100 0
0
0.1
0.2
0.3
0.4
0.5 Value
0.6
0.7
0.8
0.9
1
0.8
0.9
1
clausal_truth_percent: NTAB 6000
5000
Frequency
4000
3000
2000
1000
0
0
0.1
0.2
0.3
0.4
0.5 Value
237
0.6
0.7
Figure 11.7 continued. . .
clausal_truth_percent: W ALKSAT 4500 4000 3500
Frequency
3000 2500 2000 1500 1000 500 0 0.1
0.2
0.3
0.4
0.5 0.6 Value
238
0.7
0.8
0.9
1
Figure 11.7 continued. . . bktdsat vs. maxis (largexISs tdev) 20 18 16
Frequency
14 12 10 8 6 4 2 0 0.5
1
1.5
2
Value
bktdsat vs. tabu (percent_IS_max) 25
Frequency
20
15
10
5
0 0.2
0.25
0.3
0.35 0.4 Value
239
0.45
0.5
0.55
Figure 11.7 continued. . . itrgrdy vs. maxis (IS_size_stdev) 25
Frequency
20
15
10
5
0 1.3
1.4
1.5
1.6
1.7
1.8 Value
1.9
2
2.1
2.2
2.3
itrgrdy vs. maxis (larg_x_IS_avg) 25
Frequency
20
15
10
5
0 998.8
999
999.2
999.4 999.6 Value
240
999.8
1000
1000.2
Figure 11.7 continued. . . maxis vs. tabu (IS_size_stdev) 30
25
Frequency
20
15
10
5
0 1
1.5
2 Value
241
References [Adl96] A.-R. Adl-Tabatabai and T. Gross. Source-level debugging of scalar optimized code. SIGPLAN Notices, Vol.31, (no.5), pp.33-43, 1996. [Afc] Advanced Fibre Communications Inc. Private communication, 1999. [Aho77] A.V. Aho and J.D. Ullman. Principles of Compiler Design. Addison-Wesley, Reading, MA, 1977. [Aho83] A.V. Aho, J.E. Hopcroft, and J.D. Ullman. Data structures and algorithms. AddisonWesley, Reading, MA, 1983. [Apt] http://www.aptix.com. [Arm] http://www.arm.com. [Axi] http://www.axiscorp.com. [Bac94] D.F. Bacon et al. Compiler Transformations for High Performance Computing. ACM Computing Surveys, Vol. 26, (no.4), pp.345-420, 1994. [Bak98] B.S. Baker and U. Manber.
Deducing similarities in Java sources from bytecodes.
USENIX Technical Conference, pp.179-90, 1998. [Ban93] U. Banerjee et al. Automatic Program Parallelization. Proceedings of IEEE, Vol.81, (no.2), pp.211-243, 1993. [Bay96] R.J. Bayardo and R. Schrag. Using CSP look-back techniques to solve exceptionally hard SAT instances. Principles and Practice of Constraint Programming, pp.46-60, 1996. [Beh98] B.C. Behrens and R.R. Levary. Practical legal aspects of software reverse engineering. Communications of the ACM, Vol.41, (no.2), pp.27-9, 1998. [Ben96] W. Bender et al. Techniques for data hiding. IBM Systems Journal, Vol.35, (no.3-4), pp.313-336, 1996. [Ber95] A.A. Bertossi, M. Bonometto, and L.V. Mancini. Increasing processor utilization in hard-real-time systems with checkpoints. Real-Time Systems, Vol.9, (no.1), pp.5-29, 1995.
242
[Ber96] H. Berghel and L. O’Gorman. Protecting ownership rights through digital watermarking. Computer, Vol.29, (no.7), pp.101-103, 1996. [Bha93a] S. Bhatia and N.K. Jha. Synthesis of sequential circuits for easy testability through performance-oriented parallel partial scan. International Conference on Computer Design, pp.151-4, 1993. [Bha93b] S.S. Bhattacharyya and E.A. Lee. Scheduling synchronous dataflow graphs for efficient looping. Journal of VLSI Signal Processing, Vol.6, (no.3), pp.271-88, 1993. [Bon96] L. Boney, A.H. Tewfik and K.N. Hamdy. Digital watermarks for audio signals. International Conference on Multimedia Computing and Systems, pp.473-480, 1996. [Bor96] A.G. Bors and I. Pitas. Image watermarking using DCT domain constraints. International Conference on Image Processing, Vol.3, pp.231-344, 1996. [Bra84] R.K. Brayton et al. Logic Minimization Algorithms for VLSI Synthesis. Kluwer, Boston, MA, 1984. [Bra87] R.K. Brayton et al. MIS: a multiple-level logic optimization system. Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.6, (no.6), pp.1062-81, 1987. [Bra94] D. Brand et al. Incremental synthesis. International Conference on Computer-Aided Design, pp.14-18, 1994. [Bra96] J. Brassil and L. O’Gorman. Watermarking document images with bounding box expansion. International Workshop on Information Hiding, pp.227-235, 1996. [Bre79] D. Brelaz. New methods to color the vertices of a graph. Communications of the ACM, Vol.22, (no.4), pp.251-6, 1979. [Bri95] S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. SIGMOD Record, Vol.24, (no.2), pp.398-409, 1995. [Bro92] G. Brooks, G.J. Hansen, and S. Simmons. A new approach to debugging optimized code. SIGPLAN Notices, Vol.27, (no.7), pp.1-11, 1992. [Bry95] R.E. Bryant. Binary decision diagrams and beyond: enabling technologies for formal verification. International Conference on Computer-Aided Design, pp.236-243, 1995.
243
[Buc97] P. Buch et al. EC for power optimization using global sensitivity and synthesis flexibility. Low Power Electronics and Design, pp.88-91, 1997. [Cad] http://www.cadence.com. [Cha91] P.P. Chang, S.A. Mahlke, W.Y. Chen, N.J. Warter, and W.W. Hwu. IMPACT: an architectural framework for multiple-instruction-issue processors. Computer Architecture News, Vol.19, (no.3), pp.266-75, 1991. [Cha93] S.T. Chakradhar, V.D. Agrawal, and S.G. Rothweiler. A transitive closure algorithm for test generation. Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.12, (no.7), pp.1015-28, 1993. [Cha97] S.-C. Chang et al. Postlayout logic restructuring using alternative wires. Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.16, (no.6), pp.587-96, 1997. [Cha98] D. Chang, M.T.-C. Lee, K.-T. Cheng, and M. Marek-Sadowska. Functional scan chain testing. Design, Automation and Test in Europe, pp.278-83, 1998. [Cha99] E. Charbon and I. Torunoglu. Watermarking layout topologies. Asia and South Pacific Design Automation Conference, pp.213-16, 1999. [Che75] S-C. Chen and D.J. Kuck. Time and Parallel Processor Bounds for Linear Recurrence Systems. Transaction. on Computers, Vol.24, (no.7), pp.701-717, 1975. [Che89] K.-T. Cheng, V. Agrawal, D.D. Johnson, and T. Lin. A complete solution to the partial scan problem (IC testing). International Test Conference, pp.44-51, 1989. [Chi93] V. Chickermane, E.M. Rudnick, P. Banerjee, and J.H. Patel. Non-scan design-fortestability techniques for sequential circuits. Design Automation Conference, pp.236-41, 1993. [Coc70] J. Cocke and J.T. Schwartz. Programming Languages and Their Compilers: Preliminary Notes. New York: Courant Institute of Mathematical Science, 1970. [Coc79] J. Cocke, R.L. Malm, and J.J. Shedletsky. US Patent No.4306286. Logic simulation machine. Assignees: International Business Machines Corporation, issued 1981, filed 1979.
244
[Col98] A.J. Colmenarez and T.S. Huang. Pattern detection with information-based maximum discrimination and error bootstrapping. International Conference on Pattern Recognition, pp.222-4, 1998. [Col99] C.S. Collberg and C. Thomborson. Software Watermarking: Models and Dynamic Embeddings. Symposium on Principles of Programming Languages, 1999. [Con96a] J. Cong and Y. Ding. Combinational Logic Synthesis for LUT Based Field Programmable Gate Arrays. Trans. on Design Automation of Electronic Systems, Vol.1, (no.2), pp.145204, 1996. [Con96b] J. Cong and Y.-Y. Hwang. Simultaneous Depth and Area Minimization in LUT-based FPGA Mapping. 3rd International Symposium on FPGA, pp.68-74, 1995. [Cor90] T.H. Cormen, C.E. Leisserson, and R.L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, MA, 1990. [Cor96] M.R. Corazao, M.A. Khalaf, L.M. Guerra, M. Potkonjak, and J. Rabaey. Performance optimization using template mapping for datapath-intensive high-level synthesis. Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.15, (no.8), pp.877-888, 1996. [Cou88] D.S. Coutant, S. Meloy, and M. Ruscetta. DOC: a practical approach to source-level debugging of globally optimized code. SIGPLAN Notices, Vol.23, (no.7), pp.125-34, 1998. [Cou97] O. Coudert. Exact coloring of real-life graphs is easy. Design Automation Conference, pp.121-126, 1997. [Cox96] I.J. Cox et al. Secure spread spectrum watermarking for images, audio and video. International Conference on Image Processing, Vol.3, pp.243-246, 1996. [Cra93] J.M. Crawford. Solving Satisfiability Problems Using a Combination of Systematic and Local Search. Second DIMACS Challenge: Cliques, Coloring, and Satisfiability, 1993. [Cro75] R.E. Crochiere and A.V. Oppenheim. Analysis of Linear Networks. Proceedings of the IEEE, Vol.63, (no.4), pp.581-595, 1975. [Cul99] http://www.cs.ualberta.ca/ joe
245
[Dav60] M. Davis and H. Putnam. A Computing Procedure for Quantification Theory. Journal of the ACM, Vol.7, (no.3), pp.201-215, 1960. [DeG89] M. DeGroot. Probability and statistics. Reading, MA. Addison-Wesley, 1989. [DeM94] G. De Micheli. Synthesis and optimization of digital circuits. McGraw-Hill, New York, 1994. [Dev89] S. Devadas. Optimal layout via Boolean satisfiability. International Conference on Computer-Aided Design, pp.294-7, 1989. [Dey99] S. Dey, A. Raghunathan, N.K. Jha, and K. Wakabayashi. Controller-based power management for control-flow intensive designs. Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.18, (no.10), pp.1496-508, 1999. [Dun92] P. Duncan et al. Hi-Pass: A Computer Aided Synthesis System for Fully Parallel Digital Signal Processing ASICs. ICASSP, pp.V-605-608, 1992. [dWe85] D. de Werra. An Introduction to Timetabling. European Journal of Operations Research, Vol.19, pp.151-162, 1985. [Edw97] S. Edwards, L. Lavagno, E.A. Lee, and A. Sangiovanni-Vincentelli. Design of embedded systems: formal models, validation, and synthesis. Proceedings of the IEEE, Vol.85, (no.3), pp.366-390, 1997. [EET99] http://eet.com/news/97/946news/evidence.html [Esc] http://www.escalade.com. [Fan97] W.-J. Fang et al. A real time RTL engineering change method supporting online debugging for logic emulation applications. Design automation Conference, pp.101-6, 1997. [Fis88] C.N. Fischer and R.J. Le Blank. Crafting a Compiler. The Benjamin/Cummings Publishing Co., Menlo Park, CA, 1988. [Fle96] C. Fleurent and J.A. Ferland. Genetic and hybrid algorithms for graph coloring. Annals of Operations Research, Vol.63, pp.437-461, 1996. [Gaj81] D.D. Gajski. An Algorithm for Solving Linear Recurrences Systems on Parallel and Pipelined Machines. Vol. 30, (no.3), pp.190-206, 1981.
246
[Gaj92a] D.D. Gajski et al. High-level synthesis: introduction to chip and system design. Kluwer, 1992. [Gaj92b] D.D. Gajski, N.D. Dutt, A.C.-H. Wu, and S.Y.-L. Lin. High-level synthesis: introduction to chip and system design. Kluwer Academic Publishers, Dordrecht, Netherlands, 1992. [Gar79] M.R. Garey and D.S. Johnson. Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman, San Francisco, CA, 1979. [Gat94] J. Gateley. Sun Microsystems Integrates Emulation into the SPARC Processor and Workstation Design Process. ASIC & EDA, July 1994. [GCW99] Gray Cary Ware & Freidenrich LLP. http://www.gcwf.com/firm/groups/tein/case.html [Gen90] D. Genin, P. Hilfinger, J. Rabaey, C. Scheers, and others. DSP specification using the Silage language. International Conference on Acoustics, Speech and Signal Processing, pp.1056-60, Vol.2, 1990. [Geu92] W. Geurts, F. Catthoor, and H. De Man. Time constrained allocation and assignment techniques for high throughput signal processing. Design Automation Conference, pp.124-127, 1992. [Goe42] M.X. Goemans and D.P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, Vol.42, (no.6), pp.1115-45, 1995. [Gra73] A.H. Gray and J.D. Markel. Digital lattice and ladder filter synthesis. Transactions on Audio and Electroacoustics, Vol.21, (no.6), pp.491-500, 1973. [Gro98] D. Grover. Forensic copyright protection. Computer Law and Security Report, Vol.14, (no.2), pp.121-2, 1998. [Gu99] Jun Gu. Randomized and deterministic local search for SAT and scheduling problems. Randomization Methods in Algorithm Design, pp.61-108, 1999. [Gue93] L. Guerra, M. Potkonjak, and J. Rabaey. High Level Synthesis for Reconfigurable Datapath Structures. International Conference on Computer-Aided Design, pp.26-29, 1993. [Hac96] G.D. Hachtel and F. Somenzi. Logic synthesis and verification algorithms. Kluwer, Boston, 1996.
247
[Har89] R. Hartley et al. Tree-height minimization in pipelined architectures. International Conference on Computer-Aided Design, pp.112-115, 1989. [Har97] F. Hartung and B. Girod. Watermarking of MPEG-2 encoded video without decoding and re-encoding. Multimedia Computing and Networking, pp.264-274, 1997. [Has92] R. Hastings and B. Joyce. Purify: fast detection of memory leaks and access errors. USENIX, pp.125-136, 1992. [Hen82] J. Hennessy. Symbolic debugging of optimized code. Transactions on Programming Languages and Systems, Vol.4, (no.3), pp.323-44, 1982. [Hig93] H. Higuchi, K. Hamaguchi, and S. Yajima. Compact test sequences for scan-based sequential circuits. Fund. of Electronics, Communications and Computer Sciences, Vol.76, (no.10), pp.1676-83, 1993. [Hon97] I. Hong, D. Kirovski, and M. Potkonjak. Potential-Driven Statistical Ordering of Transformation. Design Automation Conference, pp.347-52, 1997. [Hon98] I. Hong and M. Potkonjak. IPP Techniques for Behavioral Specifications. Unpublished manuscript. 1998. [Hwa91] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu. A formal approach to the scheduling problem in high level synthesis. Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.10, (no.4), pp.464-475, 1991. [Hyd93] R.A. Hyde and K. Glover. The application of scheduled H/inf inity controllers to a VSTOL aircraft. Transactions on Automatic Control, Vol.38, (no.7), pp.1021-39, 1993. [IBM] http://www.ibm.com. [IDG99] http://www.idg.net/new docids/development/veritys/verity/ lotus/notes/infringement/terminating/agreement/new docid 9-49638.html [Iko] http://www.ikos.com. [Int] http://www.interrainc.com/picasso.html. [IW99] http://informationweek.com/newsflash/nf644/0822 st6.htm [Joh91] D.S. Johnson et al. Optimization by simulated annealing: an experimental evaluation. Graph coloring and number partitioning. Operations Research, Vol.39, (no.3), pp.378406, 1989.
248
[Kah98] A.B. Kahng et al. Robust IP Watermarking Methodologies for Physical Design. Design Automation Conference, 1998. [Kam93] A.P. Kamath et al. An interior point approach to Boolean vector function synthesis. Midwestd Symposium on Circuits and Systems, pp.185-9, 1993. [Kar72] R.M. Karp. Reducability among combinatorial problems. Complexity of Computer Computations, Plenum Press, New York, pp.85-103, 1972. [Kel97] P. Keller and R. Eads. Integration requires SOS imagination. EETimes, issue 973, pp.102, September 29, 1997. [Kha96] S.P. Khatri et al. Engineering change in a non-deterministic FSM setting. Design Automation Conference, pp.451-6, 1996. [Kif95] A. Kifli, G. Goossens, H. De Man. A unified scheduling model for high-level synthesis and code generation. The European Design and Test Conference, pp.234-8, 1995. [Kir97a] D. Kirovski and M. Potkonjak. Quantitative Approach to Functional Debugging. International Conference on Computer-Aided Design, pp.170-5, 1997. [Kir97b] D. Kirovski and M. Potkonjak. System-level synthesis of low-power hard real-time systems. Design Automation Conference, pp.697-702, 1997. [Kir98a] D. Kirovski, M. Potkonjak, and L.M. Guerra. Functional debugging of systems-on-chip. International Conference on Computer-Aided Design, pp.525-8, 1998. [Kir98b] D. Kirovski, Y.-Y. Hwang, M. Potkonjak, and J. Cong. Intellectual Property Protection by Watermarking Combinational Logic Synthesis Solutions. International Conference on Computer-Aided Design, 1998. [Kir98c] D. Kirovski and M. Potkonjak. Efficient coloring of a large spectrum of graphs. Design and Automation Conference, pp.427-32, 1998. [Kir98d] D. Kirovski and M. Potkonjak. Intellectual Property Protection using Watermarking Partial Scan Chains for Sequential Logic Test Generation. High Level Design, Test and Verification, 1998. [Kir99] D. Kirovski, M. Potkonjak, and L.M. Guerra. Improving the Observability and Controllability of Datapaths for Emulation-based Debugging. Transactions on Computer Aided Design and Circuits and Systems, 1999.
249
[Koc95] G. Koch, U. Kebschull, and W. Rosenstiel. Debugging of behavioral VHDL specifications by source level emulation. European Design Automation Conference, pp.256-61, 1995. [Kon93] H. Konuk and T. Larrabee. Explorations of sequential ATPG using Boolean satisfiability. VLSI Test Symposium, pp.85-90, 1993. [Ku92] D.C. Ku and G. De Micheli. High Level Synthesis of ASICs under Timing and Synchronization Constraints. Kluwer, Dordrecht, Netherlands, 1992. [Kui94] H. Kuijsten. US Patent No.5680583. Method and apparatus for a trace buffer in an emulation system. Assignees: Arkos Design, issued 1997, filed 1994. [Kur87] F.J. Kurdahi and A.C. Parker. REAL: a program for REgister Allocation. Design Automation Conference, pp.210-215, 1987. [Lac98] J. Lach, W.H. Mangione-Smith, and M. Potkonjak. Fingerprinting Digital Circuits on Programmable Hardware. Workshop in Information Hiding, 1998. [Lee87a] E.A. Lee and D.G. Messerschmitt. Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. Transactions on Computers, Vol.36, (no.1), pp.2435, 1987. [Lee87b] E.A. Lee and D.G. Messerschmitt. Synchronous data flow. Proceedings of the IEEE, Vol.75, (no.9), pp.1235-45, 1987. [Lee97] C. Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. IEEE Micro 30, 1997. [Lee98] C. Lee, J.S. Kin, M. Potkonjak, and W.H. Mangione-Smith. Media Architecture: General Purpose vs. Multiple Application Specific Programmable Processor. Design Automation Copnference, pp.321-6, 1998. [Lei79] F.T. Leighton. A Graph Coloring Algorithm for Large Scheduling Algorithms. Journal of Res. Natl. Bur. Standards, Vol.84, pp.489-506, 1979. [Lei91] C.E. Leiserson and J.B. Saxe. Retiming Synchronous Circuitry. Algorithmica, Vol.6, pp.5-35, 1991. [Li97] Chu Min Li and Anbulagan. Look-ahead versus look-back for satisfiability problems. Principles and Practice of Constraint Programming, pp.341-55, 1997.
250
[Lia97] S. Liao and S. Devadas. Solving Covering Problems using LPR-based Lower Bounds. Design Automation Conference, pp.117-120, 1997. [Lie97] C. Liem et al. System-on-a-chip cosimulation and compilation. IEEE Design & Test of Computers, Vol.14, (no.2), pp.16-25. 1997. [Liu92] D.L. Liu, J.-T. Li, T.B. Huang, and K.S.K. Choi. US Patent No.5425036. Method and apparatus for debugging reconfigurable emulation systems. Assignees: Quickturn Systems, issued 1995, filed 1992. [Liu97] T.H. Liu et al. Optimizing Designs Containing Black Boxes. Design Automation Conference, pp.113-116, 1997. [Lob91] D.A. Lobo and B.M. Pangrle. Redundant Operation Creation: A Scheduling Optimization Technique. Design Automation Conference, pp.775-778, 1991. [LP] ftp://ftp.es.ele.tue.nl/pub/lp solve [Lsi] http://www.lsilogic.com. [Mad89] J.C. Madre, O. Coudert, and J.P. Billon. Automating the diagnosis and the rectification of design errors with PRIAM. International Conference on Computer-Aided Design, pp.30-3, 1989. [Mak97a] S.R. Makar and E.J. McCluskey. Iddq test pattern generation for scan chain latches and flip-flops. IEEE International Workshop on IDDQ Testing, pp.2-6, 1997. [Mak97b] S.R. Makar and E.J. McCluskey. ATPG for scan chain latches and flip-flops. VLSI Test Symposium, pp.364-9, 1997. [Man97] S.T. Mangelsdorf et al. Functional verification of the HP PA 8000 processor. HewlettPackard Journal, August 1997. [Mar98] J. Marantz. Enhanced Visibility and Performance in Functional Verification by Reconstruction. Design Automation Conference, pp.164-9, 1998. [Mau86] C. Maunder. JTAG, the Joint Test Action Group. IEE Colloquium on New Ideas in Testing, pp.6/1-4. 1986. [McG95] D.F. McGahn. Copyright infringement of protected computer software: an analytical method to determine substantial similarity. Rutgers Computer & Technology Law Journal, Vol.21, (no.1), pp.88-142, 1995.
251
[McK65] W.M. McKeeman. Peephole optimization. Communications of the ACM, Vol.8, (no.7), pp.443-444, 1965. [Men] http://www.mentorg.com. [Men97] A.J. Menezes, P.C. van Oorschot, and S.A. Vanstone. Handbook of applied cryptography. Boca Raton, CRC Press, 1997. [Mic] http://www.microsoft.com/mcis. [Mil88] G.L.Miller et al. Efficient Parallel Evaluation of Straight-Line Code and Arithmetic Circuits. SIAM Journal on Computing, Vol.17, (no.4), pp.687-695, 1988. [Mip] http://www.mips.com. [Mor86] C. Morgenstern and H. Shapiro. Chromatic Number Approximation Using Simulated Annealing. Unpublished, 1986. [Mor94] C. Morgenstern. Distributed Coloration Neighborhood Search. DIMACS Series in Discrete Mathematics, Vol.0, 1994. [Mor95] Y. Morley. US Patent No.5751982. Software emulation system with dynamic translation of emulated instructions for increased processing speed. Assignees: Apple Computer, issued 1998, filed 1995. [Mot] Motorola. Private Communication, 1999. [Nor96] R.B. Norwood and E.J. McCluskey. Synthesis-for-scan and scan chain ordering. VLSI Test Symposium, pp.87-92, 1996. [Not91] S. Note, W. Geurts, F. Catthoor, and H. De Man. Cathedral-III: architecture-driven high-level synthesis for high throughput DSP applications. Design Automation Conference, pp.597-602, 1991. [Oli99] A.L. Oliviera. Robust Techniques For Watermarking Sequential Circuit Designs. Design Automation Conference, pp.837-42, 1999. [Par95] K.K. Parhi. High-Level Algorithm and Architecture Transformations for DSP Synthesis. Journal of VLSI Signal Processing, Vol.9, (no.1-2), pp.121-143, 1995. [Pat95] C. Patel. US Patent No.5546562. Method and apparatus to emulate VLSI circuits within a logic simulator. Assignees: none, issued 1996, filed 1995.
252
[Pau89] P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral synthesis of ASICs. Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.8, (no.6), pp.661–679, 1989. [Por85] M. Poret and J. McKinley. US Patent No.4674089. In-circuit emulator. Assignees: Intel Corporation, issued 1987, filed 1985. [Pot91] M. Potkonjak. Algorithms for High Level Synthesis Resource Utilization Based Approach, Ph.D. Thesis, University of California at Berkeley, 1991. [Pot92] M. Potkonjak and J. Rabaey. Maximally fast and arbitrarily fast implementation of linear computations (circuit layout CAD). IEEE/ACM International Conference on ComputerAided Design, pp.304-8, 1992. [Pot95] M. Potkonjak, S. Dey, and K. Wakabayashi. Design-for-Debugging of Application Specific Designs. International Conference on Computer-Aided Design, pp.295-301, 1995. [Pow94] G.S. Powley and J.E. DeGroat. Experiences in testing and debugging the i960 MX VHDL model. Proceedings of VHDL International Users Forum, pp.130-5, 1994. [Qu98] G. Qu and M. Potkonjak. Analysis of watermarking techniques for graph coloring problem. International Conference on Computer-Aided Design, 1998. [Qui] http://www.quickturn.com. [Rab91] J. Rabaey, C. Chu, P. Hoang, and M. Potkonjak. Fast Prototyping of Datapath-Intensive Architectures. Design and Test of Computers, Vol.8, (no.2), pp.40-51, 1991. [Rao92] D.S. Rao and F.J. Kurdahi. Partitioning by regularity extraction. Design Automation Conference, pp.235–238, 1992. [Ros95] M. Rosenblum, S.A. Herrod, E. Witchel, and A.Gupta. Complete computer system simulation: the SimOS approach. IEEE Parallel and Distributed Technology: Systems and Applications, Vol.3, (no.4), pp.34-43, 1995. [Sam88] S.P. Sample, M.R. D’Amour, and T.S. Payne. US Patent No.5109353. Apparatus for emulation of electronic hardware system. Assignees: Quickturn Systems, issued 1992, filed 1988. [Sel92] B. Selman, H.J. Levesque, and D. Mitchell. A New Method for Solving Hard Satisfiability Problems. National Conference on Artificial Intelligence, 1992.
253
[Sel93] B. Selman and H. Kautz. Domain-Independent Extensions to GSAT: Solving Large Structured Satisfiability Problems. International Conference on Artificial Intelligence, 1993. [Sel93a] B. Selman, H. Kautz, and B. Cohen. Local Search Strategies for Satisfiability Testing. Cliques, Coloring, and Satisfiability: Second DIMACS Implementation Challenge, 1993. [Sel95] B. Selman. Stochastic search and phase transitions: AI meets physics. IJCAI, pp.9981002, Vol.1, 1995. [Set70] R. Sethi and J.D. Ullman. The generation of optimal code for arithmetic expressions. Journal of the ACM, Vol.17, (no.4), pp.715-728, 1970. [Sha95] G.A. Shaw, J.C. Anderson, and V.K. Madisetti. Assessing and improving current practice in the design of application-specific signal processors. International Conference on Acoustics, Speech, and Signal Processing, pp.2707-10, 1995. [Sha96] M. Sarrafzadeh. An introduction to VLSI physical design. New York, McGraw Hill, 1996. [Sil97] J.P.M. Silva and K.A. Sakallah. Robust search algorithms for test pattern generation. International Symposium on Fault-Tolerant Computing, pp.152-61, 1997. [Sil99] J.P. Marques-Silva and K.A. Sakallah. GRASP: a search algorithm for propositional satisfiability. Transactions on Computers, Vol.48, (no.5), pp.506-21, 1999. [Smi91] M.D. Smith. Tracing with pixie. Technical Report CSL-TR-91-497, Stanford University, November 1991. [Spa95] G.A. Spanos and T.B. Maples. Performance study of a selective encryption scheme for the security of networked, real-time video. International Conference on Computer Communications and Networks, pp.2-10, 1995. [Ste96] P. Stephan et al. Combinational test generation using satisfiability. Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.15, (no.9), pp.1167-76, 1996. [Ste97] O. Steinmann et al. Tabu Search vs. random walk. Annual German Conference on Artificial Intelligence, pp.337-48, 1997.
254
[Sto89] L. Stok and R. van den Born. EASY: multiprocessor architecture optimisation. Logic and Arch. Synthesis for Silicon Compilers, pp.313-328, 1989. [Swa97] G. Swamy et al. Minimal logic re-synthesis for engineering change. International Symposium on Computers and Systems, Vol.3. pp.1596-9, 1997. [Swo93] G.L. Swoboda, M.D. Daniels, and J.A. Coomes. US Patent No.5329471. Emulation devices, systems and methods utilizing state machines. Assignees: Texas Instruments, issued 1994, filed 1993. [Syn] http://www.synopsys.com. [Syl99] D. Sylvester and K. Keutzer. Rethinking deep-submicron circuit design. Computer, Vol.32, (no.11), pp.25-33, 1999. [Tae] http://www.taeus.com. [Ten] http://www.tensilica.com. [Tho68] P.R. Thornton. Scanning Electron Microscopy. Chapman and Hall, 1968. [Ti] http://www.ti.com/sc/docs/dsps/tools/c5000/c54x/spry012.pdf [Tiw96] V. Tiwari et al. Technology mapping for low power in logic synthesis. Integration. Vol.20, (no.3), 1996. [Tri87] H. Trickey. Flamel: A High-Level Hardware Compiler. IEEE Transaction on CAD, Vol.6, (no.2), pp.259-269, 1987. [Tsa98] J. Tsai, S.-Y. Kuo, and Y.-M. Wang. Theoretical analysis for communication-induced checkpointing protocols with rollback-dependency trackability. Transactions on Parallel and Distributed Systems, Vol.9, (no.10), pp.963-71, 1998. [Tsu93] Y. Tsuji and A. Van Gelder. Incomplete thoughts about incomplete satisfiability procedures. Proceedings of the 2nd DIMACS Challenge, 1993. [Tuc97] B. Tuck. Integrating IP blocks to create a system-on-a-chip. Computer Design, Vol.36, (no.11), pp.49-62, 1997. [Uch93] K. Uchiyama et al. The Gmicro/500 superscalar microprocessor with branch buffers. IEEE Micro, Vol.13, (no.5), pp.12-22, 1993. [VSI] VSI Alliance. Fall Worldwide Member Meeting: A Year Of Achievement. October 1997.
255
[Wan97] Y.-M. Wang. Consistent global checkpoints that contain a given set of local checkpoints. Transactions on Computers, Vol.46, (no.4), pp.456-68, 1997. [Wat91] Y. Watanabe and R.K. Brayton. Incremental synthesis for EC. International Conference on Computer-Aided Design, pp.40-3, 1991. [Wol98] G. Wolfe et al. Watermarking Techniques for Intellectual Property Protection. Design Automation Conference, 1998. [Yan94] H. Yang and D.F. Wong. Efficient network flow based min-cut balanced partitioning. International Conference on Computer-Aided Design. pp.50-5, 1994. [Yan95] L. Yang et al. System design methodology of UltraSPARC-I. Design Automation Conference, pp.7-12, 1995. [Yu96] A. Yu. The future of microprocessors. IEEE Micro, Vol.16, (no.6), pp.46-53, 1996. [Zer] http://www.0-in.com [Ziv96] V. Zivojnovic and H. Meyr. Compiled HW/SW co-simulation. Design Automation Conference, pp.690-695, 1996. [Ziv98] A. Ziv and J. Bruck. Analysis of checkpointing schemes with task duplication. Transactions on Computers, Vol.47, (no.2), pp.222-7, 1998.
256