Design and Formal Verification of Image Processing ...

Design and Formal Verification of Image Processing Circuitry by

Farzad Khalvati

A Ph.D research proposal presented to the University of Waterloo for the Ph.D comprehensive examination in Electrical and Computer Engineering

Supervisor: Professor Mark Aagaard

Waterloo, Ontario, Canada, 2005

c °Farzad Khalvati 2005

Abstract Our research proposal targets an innovative method for the design and formal verification of image processing circuitry. The proposed method presents a performance improvement while preserving correctness. To date, most hardware formal verification research has focused on microprocessors, which does not directly address the challenges of image processing hardware design. Our proposal takes advantage of the progress accomplished in microprocessor design and verification to improve image processing circuitry. We propose to benefit from a characteristic of image data, value locality, to reuse or predict the result of an operation, without performing it. To verify the correctness of the design, we propose to extend a verification technique that combines equivalence checking and completion functions, to make it applicable to the verification of image processing circuitry. From the design perspective, throughput is an important measure of performance for data intensive circuits. Our goal is to increase throughput with minimum increase in area and latency. To that end, we propose to exploit advanced data flow techniques, proposed in modern processor design, to decrease the number of calculations performed on an image. These techniques that take advantage of the value locality of instruction operand(s), produce the result of an instruction without executing it by either reusing the previous results of the instruction (instruction reuse) or predicting the result of the instruction (value prediction). Consequently, the number of instructions executed per machine cycle increases. Although the two techniques demonstrate good results in theory, they have not been yet used in real design since the probability of having repeating operands of instructions is not so high. The image processing algorithms can be categorized into two classes: recursive and non-recursive. For nonrecursive algorithms, we present a modified version of instruction reuse (result reuse) that reuses the results of operations previously performed. As a preliminary study for result reuse, we achieved speedups of up to 1.82 with an average precision of 79% and extra hardware of 15%. For recursive algorithms, their recursive nature requires feedback paths that often make it difficult to increase the clock speed by conventional pipeline optimization techniques. To overcome this performance limitation, we propose to extend value prediction in such a way that it is applicable to recursive image algorithms (result prediction). Value prediction, which introduces speculative execution by predicting the result of an instruction before executing it, is well suited for image algorithms with feedback paths. Our preliminary studies show that a simple 2-bit saturation machine, used as a result prediction mechanism, predicts with 78% accuracy for typical image data used in our edge detection case study. To apply the result reuse and result prediction techniques to image circuitry, we propose to implement the design as a 2-wide or 3-wide superscalar pipeline, rather than as a scalar pipeline. Classic superscalar pipelines require twice or three times the hardware that is needed by scalar pipelines. In contrast, our proposed superscalar pipelines will only need extra hardware to implement the reuse and prediction mechanisms (including reorder buffer, memorization or speculation unit) rather than replicating the original hardware, which is the case in classic superscalar pipelines. We anticipate that it will be possible to implement our proposed superscalar pipelines with significantly less hardware than that required for classic superscalar pipelines. Our studies show that using a 3wide superscalar pipeline, the operations can be performed up to three times faster than when using a scalar pipeline, with significantly less area than a classic 3-wide pipeline design. It is expected that for images with various levels of value locality, different performance and precision results will be obtained. We propose a design that automatically adjusts the precision so as to provide a constant rate of throughput. Developing a guideline for performance improvement of different classes of image algorithms, using the result reuse and result prediction techniques, will be a final design goal. From the verification viewpoint, introducing result reuse and result prediction to image processing hardware will lead to additional pipeline hazards. To tackle this challenge, we propose to extend a method that exploits the capabilities of combinational equivalence verification in combination with the composable verification strategy of completion functions. This technique, which we have developed in a group effort, automatically verifies that the pipeline hazards are handled in microprocessors implementations. To use this verification technique for image processing hardware, we will extend it to handle the pipeline hazards that are unique to image processing circuitry. Extending and applying the above mentioned methods (completion functions) to a domain other than microprocessors, will allow us to explore and evaluate how a formal method introduced in microprocessor world can be applied to a new field. As a result, developing a formal verification methodology for image processing circuitry will be a final verification goal. ii

Contents 1 Introduction

1

2 Design Optimization Techniques for Image Processing Circuitry

6

2.1

Background on Instruction Reuse and Value Prediction . . . . . . . . . . . . . .

2.2

Image “Detailedness” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3

Result Reuse: High level simulation . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4

8

2.3.1

Memory Hit Rate Versus Results Precision . . . . . . . . . . . . . . . . 16

2.3.2

High Level Model Simulation I: Reuse Buffer Size . . . . . . . . . . . . 17

2.3.3

High Level Model Simulation II: Gray-Levels . . . . . . . . . . . . . . . 20

Result Reuse: Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1

Classic Implementation of a Case Study: Sobel Edge Detector . . . . . . 22

2.4.2

Result Reuse: Behavioral Model of Hardware . . . . . . . . . . . . . . . 23

2.4.3

Hardware Area Estimation for a Generic Model of Spatial Domain Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.4

Hardware Area Estimation for the Result Reuse Technique . . . . . . . . 30

2.4.5

Behavioral Level Implementation of Sobel Edge Detector Applying the Proposed Result Reuse Technique . . . . . . . . . . . . . . . . . . . . . 32

2.4.6

Behavioral Simulation with Small Reuse Buffer . . . . . . . . . . . . . . 36

2.5

Result Prediction: Preliminary Test . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.6

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

iii

3 Formal Verification of Pipelined Circuitry 3.1

3.2

43

Introduction to Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1.1

Formal Verification of Hardware Systems . . . . . . . . . . . . . . . . . 45

3.1.2

Successes and Challenges of Formal Verification . . . . . . . . . . . . . 46

Background on Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.1

Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.2

Equivalence Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.3

Theorem Proving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4

Combining Completion Functions with Equivalence Checking . . . . . . . . . . 62

3.5

3.4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4.2

Background on Completion Functions . . . . . . . . . . . . . . . . . . . 63

3.4.3

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.4.4

Case Study: Sobel Edge Detector . . . . . . . . . . . . . . . . . . . . . 67

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 Conclusion

70

Bibliography

73

iv

List of Tables 2.1

A 3 × 3 mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2

FPGA cells required for different operations . . . . . . . . . . . . . . . . . . . . 29

2.3

Speedup, precision and area for class Low with a 32 unit reuse buffer . . . . . . . 33

2.4

Speedup, precision and area for class Medium with a 32 unit reuse buffer . . . . 34

2.5

Speedup, precision and area for class High with a 32 unit reuse buffer . . . . . . 35

2.6

Speedup, precision and area for class Low with a 4 unit reuse buffer . . . . . . . 36

2.7

Speedup, precision and area for class Medium with a 4 unit reuse buffer . . . . . 37

2.8

Speedup, precision and area for class High with a 4 unit reuse buffer . . . . . . . 37

4.1

Proposal Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

v

List of Figures 1.1

Histogram of an 256 × 256 image . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

32 × 32 randomly selected neighborhood histograms of the cameraman image . .

3

2.1

A 6-stage scalar pipeline for microprocessors . . . . . . . . . . . . . . . . . . .

9

2.2

Instruction reuse for microprocessors [2]

. . . . . . . . . . . . . . . . . . . . . 10

2.3

Value prediction for microprocessors [2]

. . . . . . . . . . . . . . . . . . . . . 11

2.4

Detailedness algorithm results for two extreme cases . . . . . . . . . . . . . . . 14

2.5

Class Low images with average detailedness of 56% . . . . . . . . . . . . . . . 15

2.6

Class Medium images with average detailedness of 72% . . . . . . . . . . . . . 15

2.7

Class High images with average detailedness of 82% . . . . . . . . . . . . . . . 16

2.8

Flowchart of the result reuse high level model . . . . . . . . . . . . . . . . . . . 18

2.9

Result reuse and precision rate versus reuse buffer size for three different classes of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.10 Result reuse and precision rate versus number of gray levels for three different classes of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.11 Sobel block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.12 Input data table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.13 Sobel equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.14 A scalar pipeline for result reuse . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.15 Scalar and superscalar pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.16 A superscalar pipeline for result reuse . . . . . . . . . . . . . . . . . . . . . . . 26 2.17 Result reuse and precision rate for a 32 unit reuse buffer for the class Low images 33 vi

2.18 Result reuse and precision rate for a 32 unit reuse buffer for the class Medium images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.19 Result reuse and precision rate for a 32 unit reuse buffer for the class High images 35 2.20 Result reuse and precision rate for a 4 unit reuse buffer for the class Low images . 36 2.21 Result reuse and precision rate for a 4 unit reuse buffer for the class Medium images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.22 Result reuse and precision rate for a 4 unit reuse buffer for the class High images

38

2.23 Design space for result reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.24 Feedback path in hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.25 Result prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.26 2bit saturation machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1

Commutative diagram for Burch-Dill approach . . . . . . . . . . . . . . . . . . 55

3.2

Simple pipeline with flushing and completion functions commuting diagram . . . 64

3.3

Third step of simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

vii

Chapter 1 Introduction Vision is the most important means that helps sighted people collect information about their environment. Over 90% of information that a sighted person receives from her/his surroundings is through the vision system. Computer vision, which struggles to mimic the human vision system, in its early stages, is based on image processing algorithms. Digital image processing algorithms are used in a broad range of applications, such as medical image analysis, space photography, quality control, navigation, security, multimedia and etc. Many of these applications are used in real time. Image processing is generally a data intensive task and thus performing an image processing algorithm in real time is a challenging job. On the other hand, for many real time applications, it is very important to process the image data as fast as possible. For example, the visual based vehicle navigation systems, which are responsible for road environment recognition (obstacle detection and road side detection), require processing the incoming images fast enough so that the system can navigate the car based on real information [48]. Otherwise slow processing of the image data will lead to lower navigation accuracy. The data intensive nature of image processing, combined with the performance requirements of real time applications, makes it both crucial and challenging to optimize the performance of hardware implementations. Our research proposal focusses on two main research questions: • How can we improve the performance of hardware implementations of image processing 1

Design and Formal Verification of Image Processing Circuitry

2 algorithms? and at the same time

• How can we ensure the correctness of the image processing circuitry? During our periliminary research to answer these questions, we have come across the following observations: • Image Data Locality • Human Peripheral Vision • Modern Processor Techniques The first observation that we have had is image data locality, which is a characteristics of image processing that indicates the probability of having identical pixels (pixels with the same gray intensity) in an image. Many natural images contain a lot of repeated pixels that are distributed across the image (global redundancy). This can be observed through the histograms of images (Figure 1.1). Interestingly, exploring the local histograms of images shows that the redundancy 1000 900

Number of pixels

800 700 600 500 400 300 200 100 0

Gray Level 0

(a) Cameraman

50

100

150

200

250

(b) Histogram

Figure 1.1: Histogram of an 256 × 256 image

of image data is also local. This means that if we randomly choose small neighborhoods of pixels from an image, many pixels of the randomly selected neighborhood will still be redundant (Figure 1.2).

Introduction

3

40

25

35 20 30

25 15 20 10

15

10 5 5

0

0

0

50

100

150

200

250

0

50

100

(a)

150

200

250

150

200

250

(b)

50

45

45

40

40 35 35 30 30 25 25 20 20 15

15

10

10

5

5 0

0

0

50

100

150

(c)

200

250

0

50

100

(d)

Figure 1.2: 32 × 32 randomly selected neighborhood histograms of the cameraman image

The second observation was the fact that the human eye does not fully process the entire information of a scene. It reduces the amount of data to be processed by selecting a small area of the scene, on which we focus. This technique is called selective data reduction [23]. When we are looking at a scene, as long as there is no change in the peripheral part of the sight, the brain will only process the image information that comes from the small part where we focus on. Although we did not use this idea directly in our approach, it inspired us to explore the possibility of reducing the amount of data to be processed by an image processing circuitry. Finally, the third observation was advanced data flow techniques proposed for modern microprocessors. Generally, the performance of microprocessors is limited by two program characteristics: control flow limit and data flow limit. Control flow limit is caused by control hazards and arises from the speculative execution of instructions such as branch that change the flow of the program. Data flow limit, which is caused by data hazards, are due to unavailability of operand(s) of a particular instruction. We are interested in techniques that are aimed to break the data flow


4

limit. Instruction reuse and value prediction are two techniques that have been proposed in modern microprocessors design to overcome the dataflow limit. As we will discuss in more details in section 2.1, these two techniques take advantage of the value locality of instructions and their operands to produce the result of an instruction without executing it by either reusing the previous results of the instruction (Instruction Reuse) or predicting the result of the instruction (Value Prediction). As a result, the number of instructions executed per machine cycle increases and hence, the overall performance improves. In order to answer our first research question and regarding the observations that we have had, we came up with the following research hypothesis: The number of operations performed on an image can be reduced and hence the performance of image processing circuitry can be improved by either: • Reusing the results of operations previously performed on the image (suitable for nonrecursive algorithms) • Predicting the results of operations (suitable for recursive algorithms) To explore our hypothesis for design optimization of image processing circuitry, we have performed different simulations at various levels, which will be discussed in chapter 2. We have chosen the Sobel edge detector, which is a frequently used technique in image processing (e.g. motion detection, segmentation, image enhancement and target recognition) as a case study for our simulations. The results for our proposed design optimization techniques show a great potential for improving the performance of image processing circuitry using these two techniques. Our second research question deals with the formal verification of image processing circuitry. Traditional simulation tries to validate the design by comparing the simulation results with the expected results of a high level pattern, which indicates the initial intention of the designer. Formal verification, on the other hand, provides a mathematical method for both defining the high level description (specification) and verifying the the design (implementation) against the specification in a formal way. To address our second research question, which is about the verification of image process-

Introduction

5

ing circuitry, in a group effort, we have combined the completion functions approach with the combinational equivalence checking technique. Completion functions decompose the pipeline verification task into smaller modules. The initial goal was to verify a pipelined design at the register-transfer-level. We have performed the developed method on a case study (Sobel edge detector) to explore the possibility of developing a formal verification method, based on the combination of completion functions and equivalence verification, for verifying image processing circuits. Our research proposal targets two innovative optimization techniques (result reuse and result prediction) for the design of real time image processing. These techniques take advantage of the progress accomplished in microprocessor design to improve image processing circuitry, benefiting from a characteristic of image data, called value locality. Our proposal also includes a formal verification technique, which is an extension to a verification technique that combines equivalence checking and completion functions. Extending this verification technique, it is possible to formally verify pipelined image processing circuitry at the register-transfer-level. The outline for the rest of the proposal report is as follows. In chapter 2, we will introduce our two proposed techniques for design optimization of image processing circuitry and will present the simulations results. Chapter 2 also covers background information on instruction reuse and value prediction techniques as well as an algorithm called “detailedness” that classifies the input images based on their complexity. Chapter 3 will cover background information on formal verification and our approach on formal verification of image processing circuitry. Finally, we will present the conclusion and scheduling for our research proposal in chapter 4.

Chapter 2 Design Optimization Techniques for Image Processing Circuitry Image processing algorithms fall into two major categories of spatial domain and frequency domain algorithms. Spatial domain algorithms deal with image pixels directly, while frequency domain algorithms work with the result of Fourier transform of images. Transforming the images from spatial domain into frequency domain is a powerful tool to develop different image processing algorithms, especially filters. Nevertheless, to implement the frequency domain algorithms in hardware, it is often cheaper to first transform the algorithms back into spatial domain [37]. Spatial domain algorithms are based on convolution, which includes numerous multiplications and additions. To date, different hardware architectures have been proposed in order to reach better hardware design performance. Generally, to implement the convolution algorithm efficiently, multipliers are transformed into simpler units (shift-and-add). The number of shiftand-adds required to perform a convolution can be reduced by applying an optimization algorithm proposed by Eun and Sunwoo [41]. A few algorithms have been proposed to arrange the order that the additions take place to improve the overall performance of operations [13]. It has also been known that for certain pattern recognition applications, it is possible to reduce the amount of input image data (using edge detection or thinning algorithms) to attain high speed in 6

Design Optimization Techniques for Image Processing Circuitry

7

processing [33]. For robotic navigation systems, selective reduction of image information, which has been inspired by human eye, has been proposed and applied by Boluda et al. [16]. As we will discuss in details in section 2.1, instruction reuse and value prediction are two techniques that exploit the locality of instructions and their operands to improve the performance of microprocessors. We have combined these two techniques with the repeating nature of image data to develop two optimization techniques (Result Reuse and Result Prediction) for pipelined image processing circuitry. Spatial domain algorithms can be categorized into non-recursive and recursive classes of algorithms. The result reuse technique, which is applicable to non-recursive algorithms, reuses the results of operations previously performed on an image to decrease the amount of operations required to perform a convolution task and hence to improve the performance of the hardware implementation. Result prediction, which is applicable to recursive algorithms, predicts the results of operations before performing the calculations. Recursive algorithms require feedback paths in their hardware implementations, which increase latency and thereby neutralize the performance gain. As it will be explained later in section 2.5, using result prediction, pipelining a circuit with feedback path helps improve the performance. The result reuse technique improves the overall performance of image processing circuitry with minimum penalty in area and precision. Similarly, the result prediction technique improves the performance of image processing circuitry with minimum penalty in area but no penalty in precision. The outline for the rest of this chapter is as follows. In section 2.1, we present background information about the instruction reuse and value prediction techniques in modern processors design. In section 2.2, we introduce an algorithm that calculates the detailedness of images. We will use this algorithm to categorize the input images into three different classes of complexity for our case studies. Section 2.3 and 2.4 present the high level simulation of result reuse implemented in Matlab and the hardware implementation of this technique using VHDL respectively. In section 2.5, we explain how we exploit the value prediction technique to develop a technique that improves the performance of circuits, which contain feedback paths and finally section 2.6 presents the future works.

8


2.1 Background on Instruction Reuse and Value Prediction This section provides background information on the instruction reuse and value prediction techniques. Despite the innovations made in modern microprocessors design, their performance is essentially limited by two program characteristics: control flow and data flow limit. Control flow limit of the program can cause control hazards, which arise from the speculative execution of instructions. For example, instruction such as branch that change the flow of the program can cause control hazards if the program flow has been mispredicted. Although many techniques have been proposed and implemented in modern processors design to overcome the performance limitations caused by control flow limit, this issue continues to be an open and important research field in modern processor design [20]. Data flow limit, on the other hand, is caused by data hazards, which are due to un-handled data dependency between consecutive instructions. There are three different classes of data dependencies in microprocessor design: • Anti-dependence or Write After Read (WAR) dependence • Output dependence or Write After Write(WAW) dependence • True dependence or Read After Write (RAW) dependence Among the three data dependences, anti-dependence and output dependence, which are called false dependences, are caused by reusing the registers that have already been written to. False data dependences can be resolved by dynamically renaming the destination operand to a unique location. The proposed techniques that are able to remove false data dependences efficiently, have been implemented in processors deign for last four decades. Nevertheless, due to the limited number of storage locations in real design, false dependences are inevitable [20]. On the other hand, true data dependence, which determines the data flow limit in microprocessors, is the critical path between a consumer instruction and its source operands. In other words, the data flow limit occurs when every instruction is executed as soon as its operands are available. The goal of the instruction reuse and value prediction techniques are to break the data flow limit and hence to increase performance. They introduce additional instruction-level parallelism


9

by reusing or predicting the result of an instruction, without executing it. The two techniques take advantage of a property of data fed into the in-flight instructions, which is called value locality. Value locality is the probability of having similar operands of instructions in a program flow and consequently, the likelihood of having similar instruction results. Value locality is caused by many factors but among them data redundancy is a significant feature. Data redundancy is due to the fact that many programs use data that have little differences.

Fetch

Decode

Issue Register File

Execute

Commit

Figure 2.1: A 6-stage scalar pipeline for microprocessors

Figure 2.1 shows a scalar pipeline with 6 stages. Assume there are two consequent instructions that enter the pipeline: • α : R3 ← R1 + R2 • β : R4 ← R3 + R2 There is a Read-After-Write dependence between instructions α and β, meaning that when instruction β enters the stage Decode, instruction α has not executed since it still sits in the Issue stage. Instruction β has to wait two machine cycles, sitting in the Decode stage, to let instruction α go through the stage Execute and produce the result for register R3 . As soon as R3 is updated, instruction β can go ahead and read the updated value of R3 . The whole story means that as


10

long as there is a true data dependence (RAW) between instructions, a scalar pipeline has to stall frequently and as a result, it will never reach its ultimate throughput, which is one instruction per machine cycle. The idea behind instruction reuse and value prediction is to use value locality of instructions and operands to produce the result of an instruction, with which the consequent instructions have true data dependences, as soon as possible and hence to reduce the number of cycles that pipeline has to stall. Reducing the number of pipeline stalls will increase throughput up to one instruction per machine cycle (the ultimate throughput for a scalar pipeline).

Fetch

RB Access

Decode

Reuse Test

Issue Register File

If Reused

Execute

Commit

Figure 2.2: Instruction reuse for microprocessors [2]

Figure 2.2 shows how instruction reuse is implemented in a scalar pipeline for microprocessors. To reuse an instruction, it is necessary to verify that its future result is going to be the same as the previous one, in which case the result can be reused. To implement the instruction reuse scheme in hardware, a memorization mechanism is required. The hardware implementation of memorization mechanism is a memory, which is called reuse buffer (RB). Three schemes have been proposed to control the reuse buffer [1]. The first scheme checks operands values of each instruction to verify if they are present in the reuse buffer. The second scheme tracks the operand names and the third one follows the data dependences among the instructions.


11

By reusing the result of instructions, the processor can reduce data dependences and skip the execution of instructions with redundant operands and hence increase the performance [20]. Since the reuse buffer access can be pipelined, despite its size, it is unlikely to be a part of the critical path of the design [1].

Fetch

Decode

VPT Access

Issue Register File

If Mispredicted

Execute Verify Commit

Figure 2.3: Value prediction for microprocessors [2]

Figure 2.3 depicts the implementation of value prediction for a scalar pipeline [2]. Value prediction exploits the value locality of instructions and operands with speculation. As soon as an instruction is fetched, its result is predicted based on its previous results. Therefore, the instructions that follow the first instructions and need its result, can keep going through the pipeline using the predicted result. However, the speculative executed instructions will not be committed until it is confirmed that the prediction has been done correctly. If it is found out that misprediction has occurred, the speculative instructions are killed and re-executed using correct result of the previous instruction. Value prediction unit, which is responsible to predict the results of instructions, consists of two tables [20]: the classification table (CT) and the value prediction table (VPT). The classification table keeps the track of the frequency of correct predictions of instructions. It determines whether an instruction should be predicted. The value prediction table, provides the value that


12

must be used as the predicted result of an instruction. In case of misprediction, the value prediction table is updated with actual result of the instruction. Both the instruction reuse and value prediction techniques tend to break the data flow limit of instructions by providing instructions result as soon as they are fetched. Although high level simulations show good results for the two techniques, neither of them has yet been implemented in real designs. The reason is that implementing these techniques will require significant modifications of existing control and data path in microprocessors architecture and the fact that designers in industry are not convinced yet that the price they have to pay for design modifications will be returned by gain in performance [20]. Moreover, for instruction reuse technique, processors require 100% precision for instructions results, which restricts the reuse to only the instructions that have identical operands.

2.2

Image “Detailedness”

The most abstract element of each image is recognized as pixel. For images with 256 gray levels, each pixel that can be identified as an 8 bit number between 0 and 255, represents the smallest particle of an image. An image is a superset of individual pixels with certain relationships between their gray levels. For 256×256 number of pixels, we can imagine numerous images, made by various combinations of the original pixels. This means that in theory, every image processing algorithm has to deal with a huge number of different images. In general, image processing algorithms are not universal. Performance improvement techniques in particular, which deal with the relationships between pixels gray levels, have variant improvement effects on different images. The reason for different improvement results is the broad range of pixels gray levels combinations in images. To analyze the design improvement techniques in real time image processing, it is necessary to classify the input images based on their complexity. This classification will help us explore a proposed technique for different levels of complexity of images used as input sets. In addition, it may help us customize the performance improvement technique for a certain class of images.


13

We present an algorithm that calculates how complicated (or detailed) an image is. The algorithm generates a number between 0 and 100, which indicates the detailedness of the input image. For a given image, the algorithm moves along each row and column and calculates the difference between every two pixels that are located five pixels far from each other. The difference is compared to a threshold and the numbers of differences that are greater than the threshold are generated for each column and row. Afterward, the median of these numbers is calculated and divided by the median of the three largest numbers. The results for columns and rows are normalized and the average of these two results is the number that indicates the detailedness of the input image. The appropriate distance (5 pixels) and threshold were obtained by experiment. Nevertheless, they are not final numbers and may be changed in order to modify the algorithm. The following describes the algorithm. 1. input an image ’I’ 2. for each row x: for each column y: diff = abs(I (x , y) − I (x , y + 5 )) if diff >= threshold , counter1 (x ) + + 3. med row=the median of the calculated numbers (counter1) for rows 4. for each column y: for each row x: diff = abs(I (x , y) − I (x + 5 , y)) if diff >= threshold , counter2 (y) + + 5. med col=the median of the calculated numbers (counter2) for columns 6. medmax row=median of the three largest numbers of counter1 7. medmax col=median of the three largest numbers of counter2


14 8. result row = 9. result col =

med row medmax row

med col medmax col

10. detailedness =

× 100

× 100

result row+result col 2

Figure 2.4 shows the results generated by applying the algorithm to two extreme cases. As it is seen from the figure, the algorithm generated 0% and 95% as the indicator of images detailedness for the very simple and complicated images respectively. Using the detailedness algorithm, we are able to classify the images based on the level of variation of the pixels gray levels. We will use this classification in analyzing our proposed performance improvement techniques in upcoming sections.

(a) A simple image: Detailedness=0%

(b) A complicated image: Detailedness=95%

Figure 2.4: Detailedness algorithm results for two extreme cases

To categorize the input images for our simulations and as a starting point for our proposal, we chose nine different images and ran the detailedness algorithm. The results for detailedness were between 43% and 84%. We categorized the images into three classes, each containing three images: Class “Low” with average detailedness of 56%, class “Medium” with average detailedness of 72% and finally class “High” with average detailedness of 82%. Figures 2.5, 2.6 and 2.7 show the images for the three different detailedness classes.


15

Figure 2.5: Class Low images with average detailedness of 56%

Figure 2.6: Class Medium images with average detailedness of 72%

2.3 Result Reuse: High level simulation As it was discussed earlier, the image data is naturally redundant. The locality of image data can be used to decrease the number of repeating calculations. We investigate result reuse for image processing algorithms that work in spatial domain. The main idea behind result reuse is that if a certain set of calculations have been performed on a pixel neighborhood, called table; can we reuse the results of these calculations for future tables that are identical with the previous one to reduce the total number of calculations for an image? For each image processing algorithm, matching tables will produce identical results. Therefore, if we encounter a table, which is the same as the one, on which we have already performed the necessary calculations and generated the results, we can skip the calculations for the new table and reuse the results produced for the previous matching table. In other words, the greater

16


Figure 2.7: Class High images with average detailedness of 82%

number of redundant tables will result in greater rate of reusing the previously calculated results and thus the total number of calculations will be reduced. Similar to instruction reuse technique in processors, result reuse requires a reuse buffer that stores the tables and their results after performing the calculations. Pixels of each new table will be compared to the pixels stored in the reuse buffer. If all the pixels of the new table match the pixels of a table stored in the reuse buffer, the result will be retrieved from the buffer and the calculations for the new table will be skipped. Larger reuse memory will result in higher rate of memory hit and consequently larger number of reused results. In the remaining parts of this section, we will describe the high level model of result reuse in more details. Afterward in section 2.4, the hardware implementation of result reuse at behavioral level will be presented.

2.3.1 Memory Hit Rate Versus Results Precision We present the memory (the reuse buffer) hit rate, which is equal to the result reuse rate, by a number that is the percentage of the cases that the new table values match one of the tables stored in the reuse buffer. In order to increase the reorder buffer hit rate, one way is to increase the size of the memory. However, increasing the memory size will consume more hardware and will require more complicated control circuitry. On the other hand, although image data usually has repeating pixel gray levels, the probability of having redundant tables of 9 pixels is not so high in many images. The reason is high number of gray levels (i.e. 256), which decreases the


17

probability that a few particular pixels with specific spatial relationship among them (e.g. nine pixels of a 3 × 3 table) be repeated so many times in an image. To increase the probability of having redundant tables in an image, we reduced the number of gray levels of the image. As a result, the number of repeating tables significantly increased and accordingly, the reuse buffer hit rate increased as well. However, reducing the gray levels reduced the precision of the results depending on the number of gray levels used. It should be noted that precision is not identical to accuracy. We define accuracy as the correctness of the hardware implementation of an image processing algorithm. However, precision is a measure that shows how far the output is deviated from the golden result because of altering the number of gray levels to improve the performance. Defining a mathematical description for precision is considered as a future work. Reducing the number of gray levels is not a novel method in image processing. Binary thresholding is in fact a way to reduce the gray scales down to two levels. Furthermore, the method that we have used is multi-thresholding, which has also been used in a different way in a proposed fuzzy interference model for image segmentation in [47]. If we reduce the gray levels in such a way that the relationships between neighboring pixels are preserved, the precision obtained for any spatial domain algorithm will be reasonably high. To that end, we divide each pixel gray level by a number of power of 2 (between 1 and 128) and afterward we round the obtained gray level toward the closest integer number. Consequently, we will use different gray levels between 2 and 256 for our simulations.

2.3.2 High Level Model Simulation I: Reuse Buffer Size We have developed a high level model of result reuse using Matlab and explored how various sizes of the reuse buffer and/or different gray levels of pixels affect the result reuse rate significantly. As a case study, we have applied our approach to Sobel edge detector algorithm, using 3 × 3 tables. The reuse buffer should be capable of storing 3 × 3 tables and theirs results, which include 10 numbers. As we will explain later, each number can be stored as a 1 to 8 bit number, depending on the number of gray levels used (between 2 and 256). For simplicity, we call the

18


reuse buffer block that can store 10 pixels, one memory unit. We have performed high level simulations for different sizes of the reuse buffer, ranging from 4 units up to 1024 units. Read image

Create table

Check reuse buffer

Hits=No

Perform calculations

Hits=Yes

Look up result from RB

Update RB

Output

Figure 2.8: Flowchart of the result reuse high level model

For the gray levels, we have used different number of levels: 2, 4, 8, 16, 32, 64, 128 and 256. For results precision measurement, we have used Hamming distance between the result generated by the result reuse technique and the one produced by the classic Sobel algorithm. Figure 2.8 shows flowchart of the result reuse algorithm that we have implemented in Matlab for high level simulation. As Figure 2.8 shows, each new table is compared to the tables stored in the reuse buffer. If the new table hits the buffer, the result will be reused from the buffer and the operations will be skipped. Otherwise, the operations will be applied to the table and the reuse buffer will be updated with the produced result. We ran the high level simulation for different reuse buffer sizes with gray levels reduced from 256 down to 16. As Figure 2.9.a shows, the images of class Low have a very high average reuse rate (45%) for a very small reuse buffer of 4 units. Increasing the reuse buffer size up to 1024 units increases the result reuse rate up to 71%. The average precision of the results, which is only


19

dependent on the number of gray levels chosen, is 97% for the class Low images with 16 gray levels. 100

100

90

90

80

80

70

70

60

60

50

50

40

40

30

30

20

20

10

0

Result Reuse (%) Precision (%) 2

3

4

5

6

7

8

9

Result Reuse (%) Precision (%)

10

0

10

2

3

4

Reuse Buffer Size (log2 scale)

5

6

7

8

9

10


(a) Result reuse and precision rate versus reuse buffer size for class Low with the average precision of 97% for 16 gray levels

(b) Result reuse and precision rate versus reuse buffer size for class Medium with the average precision of 95% for 16 gray levels

100

90

80

70

60

50


40

30

20

10

0

2

3

4

5

6

7

8

9

10


(c) Result reuse and precision rate versus reuse buffer size for class High with the average precision of 93% for 16 gray levels

Figure 2.9: Result reuse and precision rate versus reuse buffer size for three different classes of images

It is interesting that for the class Low images, even a small reuse buffer (e.g. 4 units) can lead to a high rate of result reuse (45%) with very high precision (97%). In the applications such as automatic vehicle navigation and quality assurance, the incoming data consists of images that have consistent backgrounds and hence are less detailed. Moreover, the sequential images of a scene can contain many redundant data that can lead to a very high rate of result reuse with high percentage of precision. Another example would be medical images, which usually contain


20

consistent backgrounds. On the other hand, medical imaging is typically computationally very expensive. Applying the result reuse technique to medical applications would increase the speed of the real time systems, which could be crucial for medical purposes. We performed the same simulation for images of the class Medium and High. Although the class Medium images have the average detailedness of 72%, the 4 unit reuse buffer led to 13% result reuse and 36% reuse rate was obtained by 1024 unit reuse memory (Figure 2.9.b). For different buffer sizes, the precision was almost constant (95% ). As it was predicted, the class High images demonstrate the lowest result reuse rate among the three classes. As Figure 2.9.c shows, the reuse rate is as low as 5% for 4 unit reuse buffer and 15% for 1024 unit reuse memory with constant precision of 93% (Figure 2.9.c). This is due to the complexity of the images (82%) in this class. However, many applications such as quality assurance, vehicle navigation, medical imaging and etc., would unlikely contain such detailed and complicated images that decrease the rate of result reuse.

2.3.3 High Level Model Simulation II: Gray-Levels We performed another simulation of high level model to explore the effect of changing the number of gray levels on the result reuse rate and precision. For this simulation, we used a 512 unit reuse buffer. Reducing the number of gray levels increased the result reuse rate dramatically. Although reducing the number of gray levels down to 2 decreases the precision of results for class Low, Medium and High down to 88%, 78% and 72% respectively, it increases the average reuse rate of all three classes (Low, Medium and High) up to 99.72%, 99.61% and 99.61% respectively (Figure 2.10.a-c). On the other hand, to use 2 gray levels, we do not need to use a reuse buffer as big as 512 units. As we will see later, by small number of gray levels (e.g. 2, 4) and using much smaller reuse buffer (e.g. 4 units), high percentage of result reuse is obtained. As it is observed from Figures 2.10.a-c, for any reuse buffer size, we can find the optimal point where with reasonably high precision, high rate of result reuse can be obtained. For a 512 unit reuse buffer, the optimal points for all the three image classes are obtained by setting the number of gray levels to 4, which gives us:

Design Optimization Techniques for Image Processing Circuitry 100

100

90

90

80

80

70

70

60

60



50

50

40

40

30

30

20

20

10

0

21

10

1

2

3

4

5

6

7

0

8

1

2

3

Number of Gray Levels (log2 scale)

4

5

6

7

8


(a) Result reuse and precision rate versus number of gray levels for class Low with 512 unit reuse buffer

(b) Result reuse and and precision rate versus number of gray levels for class Medium with 512 unit reuse buffer

100

90

80

70

60


50

40

30

20

10

0

1

2

3

4

5

6

7

8


(c) Result reuse and precision rate versus number of gray levels for class High with 512 unit reuse buffer

Figure 2.10: Result reuse and precision rate versus number of gray levels for three different classes of images

• Class Low: precision=87%, result reuse=95% • Class Medium: precision=81%, result reuse=86% • Class High: precision=78%, result reuse=76% The high level simulations show that for a reasonable amount of error in the results, we can achieve high result reuse rates. Depending on the application and the amount of precision required, the system that uses the proposed technique can adjust the number of gray levels to speed up the computations using an appropriate size of reuse buffer. From the high level simulations, it

22


is also found out that the results precision is independent of the reuse buffer size. It only depends on the number of the gray levels used for the simulation.

2.4 Result Reuse: Hardware Implementation In the previous section, we presented the high level simulation of the result reuse technique. Although the simulations results show a very high percentage of result reuse with reasonable precision, we need to apply the method in hardware implementation in such a way that it affects the overall performance. In this section, first we present the classic implementation of a case study (Sobel edge detector) in hardware, which we have implemented using VHDL (section 2.4.1). Afterward, we will explain how we implement result reuse in hardware in order to optimize the performance of image processing circuitry (section 2.4.2). Although we estimate that spatial domain algorithms are generally much more complicated than Sobel algorithm and their hardware implementations would consume much more area than Sobel hardware, (section 2.4.3), we will use the area used for Sobel hardware as a reference for calculating the ratio of extra hardware that our approach needs to be implemented (section 2.4.4). Finally in section 2.4.5, we will apply the result reuse technique to Sobel edge detector circuitry to generate simulation results. In order to calculate the speedup gained by result reuse technique, we will use the minimum number of clock cycles required by a classic Sobel circuit to perform the edge detection on a 256 × 256 image as a reference, which is 256 × 256 = 65536 cycles.

2.4.1 Classic Implementation of a Case Study: Sobel Edge Detector The input to the hardware implementation of Sobel edge detector is 256 × 256 gray-level image, which is sent to the circuit serially (one pixel at each clock cycle and horizontally) buffered in a 3 × 256 intermediate buffer. When the third element in the third row is written into the intermediate buffer, 9 pixels from the first three columns are read and stored in a 3 × 3 table. From this point, at each clock cycle, a new table is generated and the calculations are performed on each new table, which include calculating derivative, absolute, maximum and magnitude


23

(Figure 2.11, 2.12 and 2.13). Our implementation contains about 700 FPGA cells and operates at a maximum clock frequency of 134MHz on a Xilinx Virtex II Pro FPGA. input (8-bit pixel)

addr

WrAddr Fsm

a1 a2 a3 b1 b2 b3 c1 c2 c3

data

Mem 3x256 bytes

addr

RdAddr Fsm

Figure 2.12: Input data table Table

data

Absolute Max1 Max2 Max3 Max4 Magnitude

Derivative (a1 + 2a2 + a3 )–(c1 + 2c2 + c3 ) (a3 + 2b3 + c3 )–(a1 + 2b1 + c1 ) (a2 + 2a3 + b3 )–(b1 + 2c1 + c2 ) (a2 + 2a1 + b1 )–(b3 + 2c3 + c2 ) Absolute Ha = |Hf |, Va = |Vf |, La = |Lf |, Ra = |Rf | Maximum M ax = max(Ha , Va , La , Ra ) Magnitude M axp = absolute value of the derivative perpendicular to M ax M ag = M ax + M axp /8 out = if (Mag > thresh) then 1 else 0

Horiz: Vert: Diag left: Diag right:

Derivative

edge

Hf Vf Lf Rf

= = = =

Figure 2.13: Sobel equations Figure 2.11: Sobel block diagram

2.4.2 Result Reuse: Behavioral Model of Hardware For stream-processing tasks such as image processing, throughput is a performance measure that is dependent on clock frequency and the number of clock cycles required to process a quantity of data. Since we have implemented our proposed technique in behavioral level, we will only consider the number of clock cycles required to process an image for performance comparison purpose. For scalar microprocessors, because of data dependency between instructions, throughput is always less than 1 (Figure 2.1). Instruction reuse in microprocessors leads to less number of


24

stalls caused by true data dependency. The ultimate goal of the instruction reuse technique is to increase throughput and make it as close as possible to 1 [2]. In contrast, there is no data dependency between operations in image processing circuitry pipeline and throughput for a scalar pipeline of an image processing circuitry is 1, which is the maximum throughput of scalar pipelines. Hence, there is no room for performance optimization in scalar pipelines of image processing implementations. Even though the pipe stage for result reuse will have less latency than the stage for regular calculations, the overall performance of scalar pipeline will stay the same.

Stage A

Stage B

Stage P: Reuse the result

Stage Q: Perfrom the calculations

Stage C

Figure 2.14: A scalar pipeline for result reuse

Figure 2.14 shows a classic scalar pipeline that implements result reuse. Assuming that stage P has less latency compared to stage Q, and supposing that we have added some control circuitry to control the pipeline such that the outputs always exit the pipeline in correct order, because the total number of clock cycles for processing an image is usually very high (e.g. 65536 cycles for a 256 × 256 image), the performance improvement will be negligible. A typical image processing algorithm (e.g. edge detector) may have latency of up to 20. The result reuse technique implemented by a scalar pipeline may decrease the latency down to 5. In the best


25

scenario that the reuse rate is 100%, the ratio between numbers of clock cycles for two situations (which we call speedup) will be:

65536+20 65536+5

≈ 1.00. As it was mentioned before, the maximum

throughput of scalar pipeline is limited by one entry per cycle [20]. To overcome this barrier in hardware implementation, we propose to use a 2-wide superscalar pipeline instead. In 2-wide superscalar pipeline that was implemented in processors in early 90s (e.g. it was implemented in Pentium microprocessor by Intel in 1993 [15]), two instructions can be issued into two parallel execution units simultaneously. The 2-wide superscalar Pentium microprocessor can ideally execute two instructions per machine cycle [20]. one incoming pixel create table

perfrom calculations

one output

(a) pipeline

Scalar

two incoming pixels create table

create table



two outputs

(b) Superscalar pipeline

Figure 2.15: Scalar and superscalar pipelines

Generally, 2-wide superscalar pipeline is obtained by replicating the hardware of scalar pipeline (Figure 2.15). For classic 2-wide superscalar pipelines with no hazards (e.g. classic implementations of image processing algorithms), doubling the area in hardware will double the throughput of the pipeline. In contrast, our proposed 2-wide superscalar pipeline increases throughput up to 2 but with an extra hardware significantly less than the hardware required by classic 2-wide pipeline. The extra hardware is to implement the reuse mechanism, which includes a reuse buffer and a reorder buffer. We have implemented a 2-wide pipeline (Figure 2.16) that accepts two pixels as inputs at each clock cycle instead of one. As it can be seen from Figure 2.16, instead of replicating the calculation stage, we have only added reuse buffer and reorder buffer. Reuse buffer stores the


26

two incoming pixels create table

create table


at least one table hits RB

update RB

reuse buffer

reorder buffer

two outputs

Figure 2.16: A superscalar pipeline for result reuse

tables and their results in order to reuse the results and reorder buffer ensures that outputs exit the pipeline in the correct order. As soon as the first two rows of pixels plus the fist two pixels of the third row entered the pipeline, at each clock cycle two tables are ready to be processed. Each table will be checked against the reuse buffer and hence three situations can occur in the reuse buffer check stage: • Both tables hit the reuse buffer: in this case the pipeline will produce two output pixels, using the reuse buffer, and the two succeeding pixels can enter the pipeline to generate two new tables. There is no need to stall the pipeline. • One of the tables hits the reuse buffer: the other table will be sent through the operations stage while the first one will simply look up the result from the reuse buffer. As a result, pipeline does not need to stall. • None of the tables hits the reuse buffer: the pipeline should stall, letting only one new pixel enter the pipeline. Both tables will be sent through the operations stage serially. In other words, the input to pipeline should stall for one clock cycle in order to let the two in-flight tables be processed in sequence.


27

From the above cases, it is found out that in ideal case in which, always either both tables hit the reuse buffer or at least one table hits the buffer, we will be able to process an n × n pixel image in

n×n 2

clock cycles instead of n × n clock cycles. In other words, using a 2-wide pipeline,

throughput of the system can be doubled for a given image in comparison to a scalar pipeline. Implementing result reuse with 2-wide superscalar pipeline, we can double throughput, with an extra hardware for reuse buffer, a control and a reordering circuitry, which is significantly less than the area needed to double the operation stage. The control circuitry is responsible to search and update the reorder buffer and the reordering circuitry is responsible to guarantee that the output data always exits the pipeline in the correct order. When the pipeline has to stall, only one new pixel can enter the pipeline. This requires the pipeline to interact with the environment to decide whether two pixels can enter at each clock cycle. An interesting alternative approach can be a pipeline that dynamically changes the number of gray levels in order to keep up with the stream of incoming pixels, to avoid stalling. We will consider this solution as a future work. Figure 2.16 shows the 2-wide pipeline used in our behavioral implementation. The extra hardware required by the proposed method should not exceed the hardware needed for doubling the operations stage in which case, it would be easier to simply double the operations circuit. However, as we will discuss in the next subsection, the result reuse technique is generally much cheaper in terms of hardware area, compared to the operations required by a typical spatial domain algorithm.

2.4.3 Hardware Area Estimation for a Generic Model of Spatial Domain Algorithms Many image processing algorithms are applied to images in spatial domain. The spatial domain algorithms that range over a broad scope include algorithms such as edge detection algorithms, smoothing spatial filters, order statistics filters, sharpening spatial filters, morphological algorithms and image segmentation algorithms. Moreover, the filters designed in frequency domain are usually converted back to spatial domain before being implemented in hardware. This is mainly due to the complexity of algorithms in frequency domain in comparison to the corre-


28

sponding algorithms in spatial domain. Although different spatial algorithms (in other words filters) perform different calculations on each pixel neighborhood, they all share a major characteristic, which is using a moving mask over the image (i.e. convolution). At each pixel of the image, the response of the filter, which is assumed to be linear, is calculated by sum of the products of the mask coefficients and the corresponding pixels in the image, covered by the filter mask. Considering Table 2.1, which shows a 3 × 3 mask, the generic equation of the filter response at each point in the image at (x,y), can be written as:

g(x, y) =

1 X 1 X

m(i, j) × image(x + i, y + j)

(2.1)

i=1 j=1

where image(x, y) is the gray level of each pixel at location (x, y) and g(x, y) is the filter response of the pixel at (x, y). Table 2.1: A 3 × 3 mask m(-1,1) m(-1,0) m(-1,1) m(0,-1) m(0,0) m(0,1) m(1,-1) m(1,0) m(1,1)

Regarding equation 2.1, we can estimate the area required to implement the filtering operations in hardware. This estimation will help us calculate the ratio between the hardware area required to implement a typical spatial domain algorithm and the one needed to implement the result reuse technique. If all the mask coefficients are non-zero, there will be nine multiplications and eight additions for each pixel in the image. Furthermore, it is possible to have different filter masks for each pixel. For instance, Sobel edge detector uses 8 different masks to detect the edges in eight different directions (north, south, west, east, north east, south west, north west and south east). Therefore, the extreme case may consist up to 72 multiplications and 64 additions as well as 9 comparisons (8 comparisons to find the maximum magnitude and 1 to compare the maximum magnitude with threshold).


29

Although in cases that the multiplicand is 2, we can use shift instead of multiply, for generality we assume that the multiplications are implemented as multipliers. Assuming that all the calculations are 8 bit wide, the following table shows the required FPGA cells for implementing adder, multiplier and comparator in Altera FPGAs.

Table 2.2: FPGA cells required for different operations Multiplier 159 Adder 25 Comparator 25

Altera FPGA cell consists a flip-flop and a 4 input combinational logic. We have obtained Table 2.2 numbers by implementing the operators in EP20K200EFC484-2X which is a member of the Altera FPGA family of APEX20KE series [6]. To estimate the number of FPGA cells required by a typical spatial domain algorithm, we need to take into account the effect of applying digital design optimization techniques that potentially can reduce the hardware area needed to implement a particular algorithm. To that end, we implemented a derivative circuitry in two versions of non-optimized and optimized using digital design optimization techniques. The results showed that the optimized version had spent area equal to 80% of the original version, while it was 70% faster. We assume that the reuse buffer will not lie on the critical path of the system, which as will be discussed in the next subsection, is a valid assumption. Therefore, we only need to consider the effect of optimizing the filtering operations from the area perspective. Considering the area required for a multiplier, adder and comparator, and with regard to the effect of optimizing the operators, the following equation represents the estimation of area needed to implement filtering operations of a 3 × 3 mask. area = 0.8 × (mult num × 159 + add num × 25 + comp num × 25)

(2.2)

where mult num, add num and comp num are the number of multipliers, adders and comparators respectively. Using this equation, we can estimate area for given number of adders and


30

multipliers for a filtering mask. For example, Kirsch edge detection operator needs 8 different masks and each mask requires 8 multiplications and 7 additions. Furthermore 8 comparisons are necessary to calculate the maximum magnitude. This gives us a total number of 9420 FPGA cells.

2.4.4 Hardware Area Estimation for the Result Reuse Technique The bottleneck in our proposed design method is the reuse buffer since the hardware needed to implement the control and reordering circuits is expected to be reasonably small (about 50 FPGA cells). Both the area required to implement the reuse buffer and the latency needed to search through the memory are serious challenges. However, with the fascinating technology of Full custom VLSI and FPGAs, the result reuse technique can be efficiently implemented. Throughout our research, we will use FPGAs as platform for implementing our designs. Despite RAM devices that use address to identify an item stored in the memory, Altera FPGAs (e.g. APEX20KE family [4]) and Xilinx FPGAs (e.g. Virtex family [19]) embed a technique that uses the content of the stored item to identify it. This technique, which is called contentaddressable memory (CAM), simultaneously compares the requested data versus a list of entries [4], rather than sequentially searching through the addresses (RAM). As a result, for matching a value with the entries already stored in a memory, CAM is 98% faster than RAM with one clock cycle latency and very short delay (4ns) for read operation [4] and two clock cycles latency for write operation [5]. Different Altera APEX family devices have various CAM sizes, ranging from 16, 384 bits up to 233, 472 bits [4]. Furthermore, the memory blocks of FPGAs (called Embedded System Block (ESB)), are separate from the rest of FPGA cells. Therefore, while using the memory blocks we can make the system work faster, we save many FPGA cells that would have been used to replicate the filtering circuit. The saved FPGA cells can be used to implement another circuit that is as complicated as the circuit of filtering operations. Nevertheless, to estimate the area used by CAM memory, we calculate the average ratio of CAM bits versus FPGA cells of Altera APEX20KE family [4]. We found out that for each CAM bit, there are about 0.174 FPGA


31

cells. Moreover, decreasing the number of gray levels, which increases reuse rate, decreases the number of CAM memory bits required to store pixels data. This is due to the fact that less number of gray levels requires less number of bits to store the gray levels. This leads us to the following equation for area used by reuse buffer, control circuitry and reorder buffer. As described earlier, CAM units, is the number of tables (each with 9 pixels) and their results to be stored (10 CAM entries) and 50 is the number of FPGA cells that is estimated to be required to implement the reorder buffer and the reuse buffer control circuitry. Assuming that the reorder buffer will have to store 10 outputs (edge detection output is 1 bit), which is a valid assumption regarding the latency difference between two pipeline paths in Figure 2.16, head and tail of the reorder buffer will need 4 bits to indicate which element to access. Thus, each of head and tail will need an incrementer and a comparator, which require about 6 FPGA cells, meaning that 16 FPGA cells are needed for reorder buffer. Similarly, we estimate that the control circuitry for reuse buffer will require less than 34 FPGA cells to be implemented.

area = 0.174 × CAM bits + 50 CAM bits = CAM units × 10 × log2 (num gray levels) area = 0.174 × CAM units × 10 × log2 (num gray levels) + 50

(2.3) (2.4) (2.5) (2.6)

Using a 32 unit reuse buffer with 8 gray levels will consume an area of: area = 0.174 × 32 × 10 × log2 8 + 50 ≈ 220

(2.7)

FPGA cells, which in comparison to the area required by many filtering operations (e.g. 9420 FPGA cells are needed for Kirsch operator) is negligible. As will be discussed later, 32 unit reuse buffer with the number of gray levels decreased down to 8 lead to high result reuse rate and hence, high performance gain is obtained.


32

The fact that two clock cycles are needed to write into the CAM memory can be a bottleneck for implementation of the reuse buffer, since it affects the total throughput. To solve this problem, a different clock can be used for CAM with a frequency twice the frequency of the clock used for the rest of the circuit. Therefore, it will take two fast clock cycles to write to CAM, which is equivalent to one clock cycle of the original clock. Regarding the short delay of the CAM memory (4ns), this solution is feasible. Moreover, multiple clocks in FPGAs are realistic. The performance of the system can even be optimized further using more aggressive result reuse technique. If we use a 3-wide superscalar pipeline, instead of 2-wide, we can still use the same reuse buffer size but at the same time perform the calculations up to three times faster compared to a scalar pipeline. This will only require more read/write ports of reuse buffer and slightly more complicated reordering circuitry, which in comparison to the obtained speedup is negligible. It is worth to mention that the speedup gained by the result reuse technique is independent of our case study, which is Sobel edge detector circuit. In other words, applying the technique to much more complicated algorithms, the same performance gain as for Sobel will be achieved. In the next two sections we will implement our proposed technique (result reuse) in hardware behavioral level using VHDL. First, we will apply our technique to Sobel circuit, using a large reuse buffer (32 units). Afterward, we will apply the technique to the same circuit (Sobel) with a small reuse buffer (4 units). For both simulations, we will calculate speedup, precision and area. Area ratio will be calculated for two cases: Sobel circuit, which consumes about 700 FPGA cells and a hypothetical spatial domain algorithm that consumes 3000 FPGA cells. This assumption is valid since we have estimated that spatial domain algorithms may need up to 9000 FPGA cells for hardware implementations (section 2.4.3).

2.4.5 Behavioral Level Implementation of Sobel Edge Detector Applying the Proposed Result Reuse Technique We implemented Sobel edge detector, applying our proposed technique at hardware behavioral level using VHDL. The implementation follows the block diagram of Figure 2.16. The input


33

to the circuit is 256 × 256 gray level images. At each clock cycle two pixels are available at the inputs. The circuit reads the inputs and generates two 3 × 3 tables. If any of the tables hits the reuse buffer, the two tables will be processed simultaneously, the buffer is updated and the pipeline will continue to pick up two new pixels at the next clock cycle. If none of the tables hits the reuse buffer, the pipeline cannot let two new pixels enter at the next clock cycle. Therefore, it will stall and handshake with the environment and request one new pixel instead of two. At the same time, one table is sent through the Sobel operator while the other waits until the first one is processed. At the next clock cycle, one new table is created (based on one new pixel) and along the one that was waiting from the previous cycle, the process is redone again to see whether either of the tables hits the reuse buffer and so on.

Table 2.3: Speedup, precision and area for class Low with a 32 unit reuse buffer gray levels precision (%) speedup area for Sobel area for an average circuit 8 91 1.63 1.31 1.07 4 90 1.80 1.23 1.05 2 88 1.93 1.15 1.02

(a) reference

(b) speedup:1.63 area:1.31 (c) speedup:1.80 area:1.23 (d) speedup:1.93 area:1.15 precision:91% precision:90% precision:88%

Figure 2.17: Result reuse and precision rate for a 32 unit reuse buffer for the class Low images

First, we ran the simulation for a 32 unit reuse buffer for three different numbers of gray levels (2,4 and 8). Assuming the latency of one clock cycle, the regular Sobel edge detector requires at least 256 × 256 = 65536 clock cycles to detect the edges of the input image. We have calculated


34

the average number of clock cycles required to process the images for each class. Speedup is the ratio between the number of clock cycles required for a regular Sobel circuit (i.e. 65536) and the result reuse circuit. In addition, we have also calculated the average precision of the output for each class. For area calculation, we have used two references: Sobel circuit with 700 FPGA cells and a hypothetical circuit with 3000 FPGA cells, which we call an average circuit. The number for area shows the result reuse technique needs what ratio of the original hardware instead of doubling it.

Table 2.4: Speedup, precision and area for class Medium with a 32 unit reuse buffer gray levels precision (%) speedup area for Sobel area for an average circuit 8 86 1.32 1.31 1.07 4 79 1.53 1.23 1.05 2 78 1.85 1.15 1.02

(a) reference


Figure 2.18: Result reuse and precision rate for a 32 unit reuse buffer for the class Medium images

As Table 2.3 shows, for the class Low of images, even with high percentage of precision (91%), which is obtained by 8 gray levels, the speed of calculations has increased by 63% with only 31% extra cost in area for Sobel and 7% for an average circuit (Figure 2.17). Figure 2.17.d shows that for many applications that we are only interested in the main object of the image, we can reduce the number of gray levels down to small numbers (e.g. 2) to gain speedup of 1.93,


35

while preserve the main functionality of the filter (88% precision) and with extra area cost of 15% for Sobel and 2% for an average circuit.

Table 2.5: Speedup, precision and area for class High with a 32 unit reuse buffer gray levels precision (%) speedup area for Sobel area for an average circuit 8 87 1.17 1.31 1.07 4 75 1.28 1.23 1.05 2 71 1.69 1.15 1.02

(a) reference


Figure 2.19: Result reuse and precision rate for a 32 unit reuse buffer for the class High images

The same simulation was performed on the images of class Medium. As Table 2.4 shows, speedup of 1.85 is obtained with 78% precision, with an extra cost of 15% and 2% for Sobel and an average circuit respectively. From Figure 2.18.d, it is observed again that even precision of 78% has preserved the main objects of the image and at the same time, it has increased the speedup to 1.85. Table 2.5 shows the results for the images of class High, for different numbers of gray levels with 32 unit reuse buffer. As it was expected, the images of this class are very detailed and hence, the rate of result reuse is not high for high number of gray levels (e.g. 8). However, reducing the number of gray levels down to 2 has increased the speedup to 1.69 with only 15% increase in area for Sobel and 2% for an average circuit. As Figure 2.19.d shows, although small number of gray levels (e.g. 2) reduces the precision down to 71%, it gives us the best speedup result (1.69).


36

Therefore, based on the required precision for specific applications, the corresponding number of gray levels can be selected to speed up the real time filtering process.

2.4.6 Behavioral Simulation with Small Reuse Buffer Although even the smallest FPGA in Altera APEX family has 16, 384 bits of CAM memory, which being used as reuse buffer, allows up to 1, 638 3 × 3 tables to be stored, we anticipate that for images with high rate of pixel redundancy, we can achieve high speedup rate even with a very small memory. To investigate this, we performed the behavioral simulation for images of the three classes with a 4 unit reuse buffer. Table 2.6: Speedup, precision and area for class Low with a 4 unit reuse buffer gray levels precision (%) speedup area for Sobel area for an average circuit 8 91 1.54 1.10 1.024 4 90 1.70 1.09 1.021 2 88 1.82 1.08 1.019

(a) reference


Figure 2.20: Result reuse and precision rate for a 4 unit reuse buffer for the class Low images

As Table 2.6 and Figure 2.20.d show the results, with the class Low images, we obtained the speedup of 1.82 with 88% precision with only 8% extra area for Sobel and 1.9% extra area for an average circuit. This means that with a very small cost in area and with acceptable precision for many applications, we can process all the filtering operations 1.82 times faster. Moreover,


37

precision of 91% was obtained with 1.54 speedup and yet again with negligible cost in hardware ( 10% for Sobel and 2.4% for an average circuit). Table 2.7: Speedup, precision and area for class Medium with a 4 unit reuse buffer gray levels precision (%) speedup area for Sobel area for an average circuit 8 86 1.25 1.10 1.024 4 79 1.40 1.09 1.021 2 78 1.63 1.08 1.019

(a) reference


Figure 2.21: Result reuse and precision rate for a 4 unit reuse buffer for the class Medium images

Performing the behavioral simulation with 4 unit reuse buffer on the images of class Medium, we found out that we can still apply a small memory to this class of images and obtain speedup of up to 1.63 with 78% precision, with very small cost in hardware (Table 2.7 and Figure 2.21.d). Table 2.8: Speedup, precision and area for class High with a 4 unit reuse buffer gray levels precision (%) speedup area for Sobel area for an average circuit 8 85 1.12 1.10 1.024 4 76 1.20 1.09 1.021 2 71 1.48 1.08 1.019

For the images of class High, as it is observed from Table 2.8 and Figure 2.22.d, with 4 unit reuse buffer, for 71% precision, we have obtained speedup of 1.48 with extra hardware of 8% and 1.9% for Sobel and an average circuit respectively.


38

(a) reference


Figure 2.22: Result reuse and precision rate for a 4 unit reuse buffer for the class High images

From the previous simulations it is found out that the parameters that should be adjusted for the result reuse technique are precision, reuse buffer size and speedup. Speedup depends on the result reuse rate. The reuse rate depends on the number of reuse buffer units (reuse buffer depth) and results precision. Precision depends on the number of gray levels. Area depends on reuse buffer units width (precision or the number of gray levels) and the number of reuse buffer units (reuse buffer depth). This leads to an interesting point: for a fixed reuse buffer depth, higher speedup will require less area since it requires less precision (less number of gray levels) and hence less reuse buffer units width (Figure 2.23). direct relation inverse relation

RB width

area

RB depth result reuse rate

independent parameters numbre of gray levels

speedup

precision

Figure 2.23: Design space for result reuse

Every specific application can have certain optimal point where it meets precision requirement and uses an appropriate reuse buffer size and at the same time benefits from the speedup gained by result reuse. This gives the designer flexibility to tune the system for different appli-


39

cations and gain the desired performance. The tuning process can be done either statically in design process or dynamically by the user.

2.5

Result Prediction: Preliminary Test

Many image processing algorithms are recursive, meaning that they require their previous results in order to generate new results. When these algorithms are implemented in hardware, their recursive nature leads to feedback paths. Pipelining is an optimization technique that breaks down time consuming tasks into smaller stages, which can be executed in parallel. Feedback paths in hardware make it difficult to optimize the performance of the circuit with pipelining.

Stage A

Stage A

Calculations

Stage B pipelined Stage C

Stage D

Figure 2.24: Feedback path in hardware

Figure 2.24 shows an attempt to pipeline a circuit that has a large “calculations” stage and a feedback path. The “calculations” stage has been pipelined into three smaller stages: B, C and D. Assume that two consecutive instructions (α and β) with true data dependence between them, enter the pipeline. When instruction α enters stage B, instruction β enters stage A. instruction α continues to flow through the pipeline by entering stage C and we expect that instruction β follow it by entering stage B. However, instruction β has to wait in stage A, until instruction α enters stage D and sends its result from this stage back to stage A. As soon as instruction β receives the result of instruction α in stage D, it can continue flowing through the pipeline by entering stage


40

B. In other words, although pipelining decreases the critical path caused by large “calculation” stage, it increases the latency of pipeline by the number of extra stages. In our example of Figure 2.24, the number of stages has been increased from 2 to 4 and hence, the latency has increased by 2 as well. For circuits with feedback path, pipelining does not improve the performance since shortening the critical path caused by pipelining is compensated by the extra latency imposed by feedback path. To solve this limitation, we propose to develop a technique (result prediction) based on value prediction, which predicts the the result of instruction α in stage D, as soon as instruction β enters stage A. This will let instruction β continue to flow through the pipeline with no need to stall in stage A, waiting for instruction α results coming from stage D. The prediction should be verified somehow with the actual result of instruction α in stage D.

Stage A

Predict stage D output

Stage B Verify the prediction

Stage C

Stage D

Figure 2.25: Result prediction

Figure 2.25 shows how our proposed technique can be implemented in hardware. Due to data locality of image data, we anticipated that the prediction can be done with high amount of accuracy. We have applied result reuse to Sobel edge detector case study to investigate the feasibility of idea for image processing. Sobel edge detector is a non-recursive algorithm. However, for the preliminary test purpose, we have assumed that there is a feedback path between the last stage and the first stage of the Sobel circuit. For prediction mechanism, we have used a simple 2 bit saturation machine, which predicts the results of operations based on their previous results.


1

S3/1 0

1 S2/0

actual result

41

0

predicted result 1

S1/0 0

1 S0/0

0

Figure 2.26: 2bit saturation machine

As Figure 2.26 depicts, starting from the initial state S0 , as long as two consecutive actual results are not 1 (states S0 , S1 and S2 ), the saturation machine will predict the result as 0. Otherwise, the result will be predicted as 1 (state S3 ). For our high level simulation (using Matlab), we used 13 different images of 256 × 256 pixels from the three classes of detailedness. Interestingly, the average accuracy of result prediction was 78%. In result prediction, the predicted result is verified against the actual result. If it turns out that misprediction has happened, the speculative executed operations are discarded and re-executed with actual result. Hence, in result prediction, results precision is always 100%.

2.6 Future Work As future plans, we will modify detailedness algorithm such that it covers broad range of images in terms of degree of complexity. We will work on the mathematical description of precision to come up with a formal description for result precision. It is necessary to distinguish between the errors caused by hardware implementation bugs and the low precision due to altering the number of gray levels. The 2-wide superscalar pipeline design of result reuse will be implemented in hardware (register-transfer-level). Moreover, a dynamic design will be considered where the number of gray levels is changed dynamically in order to gain a desired (constant) throughput


42

without need to stall the pipeline. The result reuse technique will be designed as 3-wide (or n-wide) superscalar pipeline in order to achieve higher performance. For result prediction, different algorithms will be explored to find out the best result for prediction accuracy. In addition, the technique will be implemented in hardware. We will also investigate the feasibility of combining the result reuse and result prediction techniques. In this chapter, we presented two new techniques (result reuse and result prediction) that have been inspired from the advanced processors data flow techniques (instruction reuse and value prediction). These techniques take advantage of locality of image data to optimize the performance of image processing circuitry. Various simulations performed at different levels were presented and the results were discussed. We plan to implement the techniques in hardware and to investigate other design possibilities in order to further optimize the performance of image processing circuitry. In the next chapter, formal verification of image processing circuitry will be discussed.

Chapter 3 Formal Verification of Pipelined Circuitry Pipelining is a performance optimization method for digital circuits design. It splits a time consuming task into smaller stages, which can run faster and in overlap with each other. Although pipelining is an effective optimization technique, it makes the verification task more difficult. This is mainly due to the parallel nature of pipelining, which leads to different pipeline hazards (structural, data and control). Pipeline hazards may cause pipeline to stall. Pipeline verification techniques have to be able to check the correctness of the pipeline, while considering all the possible hazards. In this chapter, first we present an introduction to formal verification and then background information on formal verification is given. Afterward, related work on formal verification of pipelined designs will be briefly reviewed. Finally, we present a technique for formal verification of pipelined circuits, which has been developed in a group effort. We have applied the technique to a classic implementation of Sobel edge detector at the register-transfer-level and our technique found two bugs.

3.1 Introduction to Formal Verification The increasing complexity of modern life requires using complicated computers, which are visible in every aspect of our daily lives. Our transportation systems (e.g. aircrafts, trains, subways

43

44


and etc.), on which we must rely everyday, heavily depend on computers. The medical equipments (e.g. radiation therapy machines, heart pacemaker, medical imaging), on which our health depends, are controlled by computers. Nuclear reactors, which provide us tremendous amount of energy, are run by computers. These systems and many others that are widely used around the globe are considered as safety-critical systems and therefore, their functional correctness is very crucial. In other words, a failure in each of these systems could cause loss of human lives. For example, an error in the timing of data entry of a radiation therapy computer (Therac-25) caused six people to be exposed to overdosed radiation during 1986-87 [30]. Even a failure in the functionality of systems that are not safety-critical, can cause huge financial losses. The well-known example for enormous financial loss is Intel FDIV (floating point division) bug that happened in 1994. Approximately 2 million Pentium chips were shipped out while they contained an error in floating point processor. The result was a financial disaster. Intel announced no-question-ask replacement which cost about US$475 million [43]. A bug in a computer system can be due to failure of the functioning of either hardware or software. To make a system work correctly, both software and hardware must be bug free. Throughout the rest of this report, we will focus on the formal verification of hardware design. The traditional way to verify the correctness of hardware (called simulation), is to generate as much as possible sequences of inputs to produce the corresponding outputs. Comparing each generated output to the expected output value can indicate presence or absence of an error in hardware design. In reality, it is almost impossible to cover all the possible combinations of inputs in order to produce output values for verification. For example, to test a 256 bit RAM, 2256 combinations of initial states and inputs would be required. To complete the simulation of the RAM in one year, we will need to perform each simulation in 2.72 × 10−70 seconds. While the fastest computers can hardly beat 1010 simulation/second, the computer that is needed to perform the complete simulation over a 256 bit RAM in one year, must be 3.67 × 1059 times faster than the fastest computers that exist today.

Formal Verification of Pipelined Circuitry

3.1.1

45

Formal Verification of Hardware Systems

Experiences show that the traditional simulation is not adequate to make a system reliable since it has limited capability of proving that a system is bug free. For example, to validate the Pentium 4 processor at pre-silicon level, the simulation of over 200 billion cycles only corresponds to 2 CPU minutes [38]. The increasing deficiencies of traditional simulation to deal with the complex designs, which contain up to several hundred million transistors, have drawn attention to formal verification and consequently it has turned out to be one of the promising and challenging techniques in hardware and system design. Formal verification, which is an algorithmic approach that exhaustively proves the functional correctness of the hardware, can be seen as a complementary to simulation that helps the hardware engineers catch the bugs, which happen in very narrow cases and cannot be found by simulators. Hardware design usually starts with a high level description, provided in different formats (e.g. block diagrams, flowchart and etc.) that indicates the final expected functionality of the system. Traditional simulation, as mentioned before, tries to validate the design by comparing the simulation results with the expected results of high level description. Formal verification, on the other hand, provides a mathematical method for both defining the high level description (specification) and verifying the the design (implementation) against the specification in a formal way. Formal specification is a brief explanation of a system’s behavior and properties in a mathematically based language. The formal explanation, which has abstracted away the unnecessary details of implementation, must formally declare what the system is supposed to do and should be insensitive to the future modification of the system [44]. Formal implementation of a system is usually given in a from of netlist or a hardware description language (VHDL or Verilog), which can be read by machine for verification purpose [44]. Formal implementation is an abstract model of the physical circuit, which contain the necessary details of the physical implementation depending on the level of abstraction. It can be described in behavioral level, register-transfer-level, gate level and circuit layout. The techniques that are currently used to verify a design implementation against its specifications include:


46

• Model Checking: Verifying an implementation, which has been described as a state machine, against specification, which has been expressed in a certain language. The verification task is usually done by a complete search over the state-space of the implementation. • Equivalence Checking: To verify that the functionality of two different circuits, which can be at two different levels of abstraction, is the same. • Theorem Proving: Implementation and specification are described in a formal logic and using mathematical reasoning, it is proved (or disproved) that the implementation satisfies specification. In section 3.2, we will discuss each of the above-mentioned methods briefly.

3.1.2 Successes and Challenges of Formal Verification Despite the fact that formal verification CAD tools have been introduced to industrial world recently, many major companies have already started using formal methods to verify the correctness of their hardware designs. After the incidence of the FDIV bug in 1994 that cost millions of dollars for Intel, this company decided to use formal verification methods to perform the presilicon verification on Pentium 4 design. It was reported that using formal methods, the number of detected bugs in structural register-transfer-level code of Pentium 4 had shown an increase of 350% in comparison to Pentium Pro validation, where only the traditional simulation had been applied [31]. Among the bugs detected by formal verification, there were two bugs that if were gone undetected, it could have caused Intel to face the same problem as FDIV bug. In addition to Intel, currently many other companies, such as Motorla [18], IBM [22] and etc., have included formal verification methods into their design flow. However, formal verification still faces series of challenges that requires research groups time and effort. The first hurdle that formal verification faces is the increasing complexity of circuits. According to Moore’s law, the number of transistors embedded in digital circuits will be doubling every 18-24 months. The increase in the number of transistors means more complicated systems can be embedded into one circuit and hence, it will require that formal verification tools have much


47

more capacity to handle the verification task. The second limitation is that formal verification methods are able to catch the bugs in pre-silicon level and the bugs that are due to fabrication faults are not discovered. Third, verifying the systems that consist of different subsystems is still a difficult task. In this case, even if all the different parts are verified separately and their functionality is correct, putting them together may cause some errors that could not have been detected by verifying each of the subsystems. The fourth challenge is that formal verification verifies the correctness of an implementation against specification, which itself can have bugs [44]. This requires that specification itself should be verified against the real requirements before being used by formal verification tools. Finally, the formal verification tools can contain errors themselves.

3.2

Background on Formal Verification

In this section, we will briefly review the basic mathematical background used in hardware formal verification. To verify the correctness of a digital system, there are three common approaches: Model Checking, Equivalence Checking and Theorem Proving. In all these methods, a model of the implementation needs to be created. The model should be capable of representing the design and its implementation at the same time. This means that the created model may have different levels of details. It should also contain all the properties that are essential for establishing the correctness of the design. From the correctness perspective, the capability of having different levels of details makes the model contain details of implementation that do not contribute in establishing the correctness of the design and rather make the verification task complicated [14]. Reducing the complexity of verification task is possible by abstracting away the unnecessary details of an implementation. However, it is very important to maintain the properties that are essential to build the correctness of the model.


48

3.2.1 Model Checking Model checking is a verification technique that verifies whether the satisfaction relation (|=) holds between a model (or implementation) and a property (or specification). We show the satisfaction relation between implementation and specification by Imp |= Spec. A model-checker performs an exhaustive search over the state-space of the implementation, which is usually a finite state machine, with regard to the specification. Two characteristics that make the verification of digital circuits a difficult task are reactiveness and concurrency. Verifying non-reactive systems, which their behavior can be modeled only with respect to their inputs and outputs (regardless of the environment), can be relatively a simple task. In contrast, reactive systems interact with their environment frequently and therefore, in modeling their behavior, the effect of interaction with environment should be considered. In general, reactive systems do not terminate [28] and it can be imagined that they run inside an infinite loop, e.g. operating systems, embedded systems, hardware systems and etc. Concurrency means that a set of components of a system execute at the same time [14]. In other words, a concurrent system contains components that perform their tasks simultaneously. Digital systems are concurrent systems since all of their components may run in parallel. To express the properties, which are expected to be satisfied by the implementation, temporal logics are used. Temporal logics are capable of expressing the specification of concurrent and reactive systems, since they are able to express the order of events in time. There are different temporal logics that are used for model checking. To choose an appropriate logic for model checking, there are a few considerations [8]: The first concern is expressiveness, which addresses if the logic can express a certain property. The second issue is the complexity level of the verification task using a certain temporal logic. Although a more expressive logic can express more complicated properties, it might be computationally expensive to verify a system against its properties expressed in that particular logic. Third, the properties that are expressed in temporal logic should be validated against the top level informal intention of the designer. The most commonly used temporal logics in commercial model-checkers are Linear Tree Logic (LT L), Computation Tree Logic (CT L) and CT L∗ [14]. Each temporal logic can be


49

useful for expressing the properties of certain systems. LT L is capable of expressing one particular sequence of states. Therefore, to express non-deterministic systems, more than one trace is needed. In LT L, there is no existential quantification over the paths, which might limit the expressiveness of the language. CT L is capable of expressing non-deterministic systems where each instant of time can have more than one possible future. This logic is very powerful in verifying hardware and communication protocols and recently, it is being applied to software systems as well. CT L is a branching-time logic, meaning that its model of time is a tree-like structure where the future is not determined. There are different paths in the future, any of which might be the actual path that is realized. CT L∗ is a superset of CT L and LT L. PSL [3], which has been designed to be formal and intutive, is a specification language that is used for specifying the properties of a design in a mathematically precise way. PSL is supported by verification tools such as Averanet Solidify [7], Cadence Incisive verification platform (Verplex) [9] and etc. SMV (Symbolic Model Verifier) [10] is a program that checks if the implementation, described in SMV language, satisfies the specification, described by LT L. Reactive and concurrent finite state systems (software or hardware) can be implemented using SMV language. SMV uses OBDDs (Ordered Binary Decision Diagrams) based symbolic model checking [14] to verify if the implementation obeys the specified properties. Cadence SMV [24] is an upgraded version of SMV, which supports SMV and Verilog languages. The specifications can be described using LT L, CT L or finite automata. In general, SMV is capable of modeling synchronous systems where all the assignments are done in parallel. It can also model asynchronous systems where the modules executions can be interleaved. SMV is mainly suitable for modeling and verification of control circuitries where no complex datapath is present. The examples for applications that can be verified using SMV include control circuitries in microprocessors, operating systems, embedded systems, process-control systems, financial trading systems and automated banking machines. FormalCheck, which is another model-checker, provides a mathematical proof that ensures a property of a design model holds [11]. It accepts the synthesizable subset of VHDL and Verilog.


50

As such, there is no need, in theory, to make any modifications to the source code of these languages for the purpose of verification. The register-transfer-level code of model, as well as the desired specification for that model, is supplied to FormalCheck. FormalCheck compiles the register-transfer-level code and provides feedback in the event of any syntax errors. After fixing the errors, one can recompile the code through FormalCheck without the need for external software. In spite of other model-checkers, which use some form of CT L or LT L to define properties, in FormalCheck, for the sake of simplicity, each property is defined using one of a small set of templates, called queries, each with a clear intuitive and simple semantics. However, the simplicity of FormalCheck queries semantic reduces the flexibility that might be needed to express a property. The queries are applied to the model to determine if they hold for the given model. Queries are composed of both properties and constraints. Properties are the parts of the model that one is trying to verify and the constraints can be used to specify the behavior of the environment. In FormalCheck, there are also state variables, which can be used to track the state of the system (both internal and external signals). These state variables greatly improve the expressive power of the queries. Typically, these state variables are used in the same manner as flip-flops: on the rising edge of the clock, they store the value on their inputs and make that value available on the outputs [17]. FormalCheck uses a combination of different model checking algorithms including, symbolic state enumeration (OBDDs), explicit state enumeration and auto-restrict [11]. If the model to be verified is too large with numerous states, symbolic state enumeration might be chosen. In contrast, explicit state enumeration is faster for systems with less than 1000 inputs/states. In addition, to narrow down the model under study to a portion, which is expected to contain bug, auto-restrict algorithm is used. The drawback of auto-restrict algorithm is that it might end up with no error message, while the other portions of the model might still contain bugs. Hence, if auto-restrict algorithm ends up with “no bug” message, the verification must be redone using the other two algorithms. Similar to SMV, FormalCheck is best suited for the verification of control circuitries such as microprocessor pipeline control unit. In overall, the main limit in model checking is state explosion problem, which happens when


51

the number of states of the system under study excesses the capacity of the machine’s memory on which, the model checker is running. As a result, model checking is usually applicable to medium and small size circuits.

3.2.2

Equivalence Checking

Boolean function is an abstract mathematical model of a corresponding combinational circuit. Two combinational circuits are considered to be equivalent if they demonstrate the same behavior over a sequence of inputs. In other words, the functionality of two combinational circuits is similar if they produce the same outputs for the same set of inputs. One way to verify the equivalence of two combinational circuits is to simulate them with all possible inputs and compare all the corresponding outputs of the two circuits. This is usually done by function tables that assign a value for every possible combination of inputs. However, this approach is not feasible since the number of test cases grows exponentially with respect to the increase in the number of inputs of combinational circuit. That is, for a combinational circuit with n inputs, we will need to generate 2n test vectors in order to perform complete validation. This number can be quite big for large circuits with more than 100 inputs and therefore, non-practical (in terms of time and storage) for real world. Equivalence checking is a verification method that verifies the equality of two circuits, which can be described at two different levels of abstraction (e.g. behavioral, register-transfer-level or transistor level netlist). The equivalence verification process can be divided into three main steps. In the first step, associated points in two circuits are detected in order to match the specification and implementation. Typically, the match points are the inputs, registers (flip-flops), and outputs of the circuit. In the second step, a set of compare points between the specification and implementation are identified. These compare points, which are typically a subset of the match points, represent the points in both the specification and implementation that must be verified for equivalence. In the third step, the next state function of each compare point in the specification is verified against the next state function of the corresponding compare point in the implementation. If this comparison is successful for each compare point, then the specification and implementation are considered to be equivalent. If there are compare points


52

that are not equivalent, then by examining the logic of those compare points, the implementation error can be detected [27]. Equivalence checking tool translates the two given circuits into boolean representations and afterward, the two boolean networks are compared to each other. To perform equivalence checking, Binary Decision Diagrams (BDDs) [14] and Boolean Satisfiability (SAT) [12] are used frequently by verification CAD tool designers. With equivalence checking, most of the verification task is automated and less human interaction is required. Equivalence checking can be applied either to combinational circuits or sequential circuits. Combinational equivalence checking can only verify the equality of two combinational circuits. Therefore, it cannot verify the correctness of implementation where a re-timing optimization has been performed against the specification. Sequential equivalence checking, on the other hand, attempts to verify an implementation, which has been optimized by moving the modules across the flip-flops. Sequential equivalence checking is still a big challenge in formal verification field and to date, there is no industrial tool that performs it. There are a few commercial equivalence checking tools including: Synopsys Formality and Mentor Graphics FormalPro. Formality [42] accepts the combination of various hardware description formats as input, including SystemVerilog, Verilog, VHDL, EDIF, Synopsys DB, DDC, MDB and SPICE (Formality-ESP); and FormalPro [29] accepts VHDL and Verilog. All the tools perform combinational equivalence verification. The specification and implementation can be provided at any level of abstraction, from behavioural level down to gate-level. To verify two sequential circuits, both circuits must have the same state variables, which means the behaviour of each corresponding portion of the circuits between two flip-flops should be the same.

3.2.3 Theorem Proving Theorem proving is a mechanization of mathematical reasoning. It proves whether an implementation satisfies a specification. Both implementation and specification are expressed as formula in a formal logic [40]. A proof system consists of a set of axioms and a set of inference rules.


53

ACL2 [46] is a theorem prover that support first-order logic. Models of different computer systems, hardware or software, can be built in ACL2 and thus, the theorems about the models can be proved using ACL2. First-order logic is semi-decidable and although it is unable to model time, which is essential in sequential circuits, it is used for hardware verification (e.g. ACL2) because of reasonably high level of automation. Higher-Order logic is more expressive for hardware than first-order logic and hence it is used frequently for hardware verification. Developed at the University of Cambridge, HOL [45] is a theorem prover that supports higher-order logic and partly automates the proofs. For hardware verification, HOL can be used to directly prove theorems where higher-order logic is used as specification language and modeling [40]. In theorem proving, first the implementation is usually verified against a high level model, called reference model and then, the reference model is verified against the specification. While the other verification approaches do not benefit from the fact that hardware circuits are hierarchically designed, theorem proving can exploit the hierarchy and regularity of design giving user more control. For instance, in theorem proving to verify a system, which consists of submodules, a sub-module can be verified first. Afterward, the verified sub-module can be considered as a correct module (lemma in proof) and the rest of the system can be verified based on the correctness of the verified sub-module and so on. However, although some model-checkers demonstrate the modules that contain error, practically knowing these modules does not help take advantage of hierarchical nature of digital design. Theorem proving is suitable for parameterized datapath dominated design. However, it requires a large amount of human interaction and hence deep familiarity of theorem provers such as HOL or ACL2 is necessary in order to model a system and construct a proof [40][46].

3.3 Related Work Pipelined circuits are an important portion of digital systems and verifying the correctness of such systems is a crucial task. Many researches have been done on the verification of pipelined


54

machines. In the remainder of this section, we will cover the related works on the formal verification of pipelined circuits. Since our goal is to formally verify image processing circuitries, implemented as pipeline, we studied the techniques that have been proposed for verifying pipelined circuitries. Microprocessor design is the major field that is based on pipelining. Therefore, we have selected various verification methods that have been proposed for verifying pipelined microprocessors. We have also covered two works that focuss on formal verification of nonmicroprocessor designs. Windley and Coe [34] presented the verification of a five stage pipelined microprocessor (UINTA) control unit, which includes data and control hazards. The authors have developed different specifications for different levels of abstraction, ranging from register-transfer-level to architectural level, which represents a perspective of the microprocessor that is viewed by the assembly language programmer. The specifications for different levels of abstractions have been defined using generic interpreter theory, which is a way of modeling using a state transition system. Using the interpreter model, each level of abstraction implies a level, which lies on top of the previous level. For the sake of accuracy of the specifications and the proofs, HOL has been chosen as a mechanical system of proof. Using the symbolic execution of the pipeline, each instruction is considered under both cases whether or not there is a stall. In this method, although different levels of abstraction reduce the complexity of specification and verification gradually, it makes the verification a tedious task by verifying every two adjacent levels of abstraction manually. Moreover, the overall verification time depend on the size of the instruction set linearly. Burch and Dill [21] verified a pipelined implementation of control circuitry of a subset of DLX architecture. DLX is a 32 bit pipelined RISC processor with 5 stages including, Instruction Fetch, Instruction Decode, Execute, Memory and Write Back. The subset of DLX that has been verified in this work includes six types of instructions: store, load, unconditional jump, conditional branch, 3-register ALU instructions and ALU immediate instructions. The authors presented a new method, which verifies the control circuitry of pipelined microprocessors automatically, assuming that the combinational logic in datapath is correct. In this method, the


55

HDL description of pipeline for both specification and implementation are translated to transition functions by a compiler using symbolic simulation. The transition takes a current state and inputs as its arguments and returns next state. The specification and implementation are required to have corresponding input signals. In this work, a new concept was introduced, “flushing the pipeline”, which makes it possible to study the effect of completing one instruction on different stages of pipeline. Flushing a pipeline means to stall the pipeline, such that no new instruction enters the pipeline while the inflight instructions continue execution. Because the instructions are executed in pipeline partially, it is difficult to compare the implementation state against the specification state. Using flushing, the effect of completing a particular instruction in implementation can be verified against the effect of completing the same instruction in specification. QImpl

Flushing the Pipeline

Flushed QImpl

Take one step in Impl

/ QImpl

Projection

Flushing the Pipeline

QSpec

Take one step in Spec

Flushed Q/ Impl

Projection

/ QSpec

Figure 3.1: Commutative diagram for Burch-Dill approach

Figure 3.1 shows commutative diagram of the proposed verification method. Starting from old implementation state (QImpl ), flushing the pipeline will produce “flushed old implementation state”. By removing all parts of the flushed old implementation state that are not visible to the programmer (using a function called projection), old specification state (QSpec ) is obtained. QImpl and QSpec are two points from the implementation and specification that match each other. An arbitrary input I to both the implementation and specification will change the implementation and specification states from old states, QImpl and QSpec to new states Q0Impl and Q0Spec respectively. The implementation satisfies the specification if and only if Q0Spec matches Q0Impl . To check if the two new matching states are equal, Q0Impl is flushed and projected (using


56

a projection function) and the result is compared to Q0Spec . This method, automatizes the verification task and reduces the human intervention. Moreover, since it is possible to study the effect of single instruction on the visible states of pipeline using flushing, complex pipelines can be verified. However, as the complexity of the processors increases (e.g. out of order processors), the number of cases to be tested and the amount of flushing task increases impractically. In other words, flushing the pipeline is computationally expensive. Sawada and Hunt [22] verified FM9801 microprocessor using ACL2 theorem prover. The verified processor, which is an out of order processor, has different features such as speculative execution, precise handling of internal exceptions and external interrupts and supervisor/user mode. The processor is defined using ACL2 at two levels: ISA (Instruction Set Architecture) and MA (Microarchitecture). The ISA level executes the instructions sequentially and MA models the pipeline of the processor based on clock cycle accuracy. The two initial matching points are M A0 and ISA0 . M A0 is obtained by flushing the initial state of MA and ISA0 is achieved by applying the projection function to M A0 . To get the next two arbitrary matching points (M An and ISAm ), two functions are defined: ISA-step(ISA, intr), which returns the ISA state after executing one instruction of ISA with external interrupt signal intr and M A-step(M A, sigs), which returns the MA after one clock cycle execution of the pipeline with external signal sigs. By calling M A-step(M A, sigs) n times and ISA-step(ISA, intr) m times, the next two matching points (M An and ISAm ) are obtained. To verify if these two points match, M An is projected and the results is compared against ISAm . The verification found 14 bugs in machine design, which were not detected by simulation. 19 properties were defined that had to be verified, which made the verification task very intensive. Although the authors have verified a complex out of order pipeline, because they have split the job into smaller pieces, the verification technique seemed to scale well with the size of the ISA and MA scripts. Hosabettu et. al. [39] introduced completion functions in order to decompose the verification task of complex pipelines. In this method, one abstraction function is provided by user for every


57

unfinished instruction and then the functions are composed in program order. Each function specifies the desired effect of completing the instruction on visible registers of the pipeline. We will explain this method in more details in section 3.4 Prabhat and et. al. [36] presented a top-down validation technique using symbolic simulation to verify the memory management unit of a microprocessor, which is compliant with PowerPC. In this work, using the architecture specification documentation, which was provided in English, a set of properties are defined in Verilog to ensure that the register-transfer-level implementation satisfies the specification. The register-transfer-level design is translated to boolean model and the properties are converted into a state machine. Afterward, the boolean model of the registertransfer-level design and the state machine of the properties are fed to a symbolic simulator (Versys2). Versys2 verifies if the properties are satisfied by the register-transfer-level design and produces a counterexample if there is a mismatch. In spite of many other approaches that apply a bottom-up method to extract the specification, Prabhat et al. employed a top-down approach to develop the properties in order to validate the design. In general, in symbolic simulation, the behavior of the system can be examined at certain clock cycle, providing the ability to verifying the data-intensive circuitry, rather than verifying the control circuitry. In another work, Prabhat and et al. [35] proposed a FSM based modeling of pipelined microprocessors, which uses a set of properties to verify the correctness of the pipeline. In this approach, the processor specification is defined in an Architecture Description Language (ADL) (e.g. EXPRESSION ADL), based on the documentations that the architects provide, and the FSM model is automatically generated from the ADL model. Afterward, to ensure that the generated FSM from the original ADL model captures the behaviour of the processor based on the architects’ documentations, several properties such as finiteness, determinism and in order execution are defined. These properties are automatically applied to the FSM model of the processor to verify the correctness of in-order execution of the processor. Using this method, the controller of a single-issue DLX processor, which is an in-order processor with fragmented pipeline and multi-cycle functional units was verified. Aagaard and et al. [26] presented a framework for categorizing different correctness state-

58


ments of safety properties used for formal verification of microprocessors. It is assumed that the implementation and specification are described as FSM. A correctness statement is often described as: “every trace of external observations generated by the implementation can also be generated by the specification”. To categorize the correctness statements, the framework includes four parameters: alignment, match, implementation execution and specification execution. Alignment means that which states from implementation and specification should match. Seven different alignments have been found to be used in microprocessors verification, which have been described in the framework. The classic alignment (Pointwise), which compares every step in the implementation and specification, is in fact the commuting diagram. More complicated situations such as stall, flush and out of order execution have also been covered by different alignment definitions. Match defines the transition relation between aligned implementation state and specification state. The match relations include: the abstraction match, which uses a function to map an implementation state to a visible specification state, the flushing match, which uses flushing to match an implementation state to a visible specification state, the equality match that requires the implementation state and specification state to be externally equivalent and finally the refinement match, which requires that the abstraction function sustain the externally-visible part of the implementation. The equality and refinement matches become identical when all the specification’s states are externally visible. For the last two parameters of the framework, executions, both deterministic and non-deterministic situations have been considered. By selecting different combinations of the four parameters discussed earlier, the general form of the correctness statement is defined as:< alignment >< match >< impl.execution >< spec.execution >. Using the general definition of the correctness statement, its mathematical formulation has been defined. A literature survey done by Aagaard and et al. [26] shows that the proposed framework covers various correctness statements used to verify out-of-order and superscalar microprocessors. The framework leads to an observation that using a flush-based alignment with equality match is easier for pipeline verification. However, due to the capacity limits of verifiers, applying simple flushing to an out-of-order processor is not practical. It is difficult to align the implementation states with the specification of those machines that retire


59

the instructions out-of-order. In general, stall complicates the alignment process. The matching process is complicated for the processors that handles exceptions. It is found out that synchronizing the implementation and specification machines at instruction retirement, rather than instruction issue, makes it easier to handle the exceptions. Finally in this paper, a few observation have been suggested on how to choose a correctness statement for verifying different types of processors. Aagaard [25] presented a formal model and a correctness statement, which are based on pipeline stages, parcels and the three different types of pipeline hazards: structural, control and data hazards. The presented correctness statement, called PipeOk, contains thirteen correctness obligations each of which, describes a single type of behaviour. PipeOk decomposes the standard flushing correctness statement of Burch-Dill into separate correctness obligations, related to structural, control and data hazards, datapath functionality and flushing a pipeline. The model presents a description of a pipeline, “parcel view”, which focusses on the transfer of the parcels between the stages and the data storage operations (reads and writes) that the parcels perform. The presented model is based on the parcel view of a pipeline and consistency between the specification and the parcel view of the pipeline. A parcels view of a pipeline is consistent with a specification if, every signal or register that is read in the specification have at least one corresponding signal or register in the implementation of the pipeline and all the corresponding signals have the same value. The first two obligations of the three structural hazard correctness obligations guarantee a one-to-one mapping between parcels that enter the pipeline and should exit and those that do exit. The third obligation relates the flow of parcels inside the pipeline to the entrance and exit of parcels that are externally visible. This allows the pipeline to be considered as a black box for hierarchical verification. Six data hazard correctness obligations guarantee that reads and writes in the implementation take place in the correct sequence. There is one datapath correctness obligation, which ensures that if with every read operation that a parcel performs, the correct data is read then every write operation that parcel does will produce the correct data. The control hazard correctness obligations have been embedded into structural and data haz-


60

ard correctness obligations. These ten obligations guarantee that every parcel that enters the pipeline and must exit, generates the correct result. To guarantee the correctness of flushing, three other obligations have been defined, which causes PipeOk to imply Burch-Dill flushing. PipeOk is attempted to increase the verification capacity, by decomposing a verification task into smaller subparts. It also aims to make the verification task intuitive to both verification engineers and design engineers. However, it requires verification engineers to have a deep understanding of the pipeline-specific characteristics. Although the main focus of formal verification research has been microprocessors, nonmicroprocessor pipelined circuits have also received attention among formal verification research groups. The non-microprocessor pipelined circuits that are widely used in many applications, are complex enough to require formal verification in order to minimize the probability of uncovered design bugs. Narasimhan and Vemuri [32] presented a set of properties in CTL that are crucial for the correctness of a design controller, which has been synthesized from a behavioural VHDL code. The controller consists of a set of concurrent FSMs, each of which has been assigned to a module in behavioural description of the design. The set of proposed properties guarantee that the FSMs will have the same behaviour as the behaviour description, with regard to the environment. Although the properties are necessary for any design to be satisfied, they can be a part of synthesizer program and hence, be hidden from a register-transfer-level designer. In other words, they do not contribute to the verification of functionality of a design, which has been developed at the register-transfer-level. Jang and et al. at Motorla [18] performed a formal verification experience on a safety chip (FIRE) used in cars’ control system, using a BDD based model checker. FIRE is a complicated car safety chip, which in case of an accident, can raise appropriate signals to high in order to have the airbags pop up or the seat belts tighten. FIRE is an interface between a microcontroller and analog safety devices. The microcontroller can communicate with FIRE by writing to or reading from the internal registers of FIRE using a complicated protocol. When the car is involved in an accident, the microcotroller and an another module produces a sequence of signals that have


61

FIRE send “crash” signal to the analog safety devices. The most important property that FIRE must satisfy is that it will send “crash” signal to the analog device if and only if the microcontroller has already sent appropriate signals to FIRE. Otherwise, if crash signal goes high while car has not crashed, an accident will be almost inevitable! There were three problems with defining such a property. First, it was difficult to extract the exact specification of the design using the specification documents written by the designers. Second, the properties to verify were too complex to be defined by CTL simply using the interface signals. It required much more knowledge about the internal signals of the system. Finally, the well known state-space explosion problem was the third hurdle of verification task. To overcome the first and second problem, the verification team had to study the Verilog model of FIRE in detail. The third problem was tackled by dividing the main proof to smaller obligations. The verification team did not develop a proof to indicate that the local obligations are sufficient in order to verify the entire system. However, they found three bugs, which validated their method. In many cases, the verification team had to abstract away parts of the system that did not seem to have influence on the verification of the system. This task was reported to be a tedious manual job. Another problem was that FIRE used two asynchronous clock signals. The two asynchronous clocks were synchronized since the model-checker accepted only one global signal. Although synchronizing the two clocks might cause some erroneous result, the designer of the system felt doing so is a safe assumption. As a result, 76 CTL properties related to different parts of the system were defined and verified. Three bugs were discovered one of which, was critical. The bug was related to the initial value of a register in register file of FIRE. It could cause a sequence of actions that might lead to firing “crash” signal. The authors conclude that the best people who can write precise properties for the design and use the model-checker efficiently, are the designers themselves. In other words, the design cannot be treated as a black box by the verification engineers. The main obstacle in verifying the pipelined processors is the nature of pipelining that complicates the verification task. The main focus of verification research performed on formal verification of microprocessors is to reduce the complexity of verification task from both the human


62

and machine point of view. From the human perspective, the goal is to make the amount of human interaction as low as possible by performing the verification task more automatically. From the machine point of view, it is aimed to explore methods that require less machine capacity (CPU time and memory) by usually splitting the verification task into smaller modules. As it is found out from the related works, formal verification research groups mainly focus on the verification of pipelined microprocessors. This seems reasonable since the processors design is significantly based on pipelining. However, there are many digital circuits that are based on pipelining as well but they do not belong to microprocessors. Pipelined image processing circuitry is one of these type of pipelined designs that is used in many applications. In the next section, we will explain a verification technique that can be applied to pipelined image processing circuitry.

3.4 Combining Completion Functions with Equivalence Checking In this section, first we will describe a formal verification method that we have developed in a group effort [27]. Afterward, we will discuss a case study (edge detector) that was verified using this method.

3.4.1 Introduction Although many researches have been performed on formal verification techniques and tools, there are still hurdles for using formal verification methods and tools in real world. First, many investigations that have been carried out (and are still being done) in this field is related to verification of microprocessors. Although microprocessors are an important part of digital world, there are many other systems that are different from microprocessors. Therefore, the techniques and tools created for formal verification of microprocessors do not directly address the challenges of the other systems verification. Second, the techniques and tools developed for microproces-


63

sors generally limit the verifiers to work at high level models of design rather than working at the register-transfer-level. For many systems, verifying high level models do not usually guarantee that the actual implementation is bug free. Many bugs are introduced while converting a high level model to implementation. Among various formal verification strategies that have been developed for verification of pipelined processors, completion functions approach [39] is one of the most intuitive and compositional methods. This method decomposes the pipeline verification task into smaller modules (completion functions). We have combined the completion functions approach with combinational equivalence checking technique to develop a method for verification of pipelined implementation at the register-transfer-level against the specification. The goal is to verify a pipelined design at the register-transfer-level rather than at higher levels of abstraction. Hardware optimization techniques are applied to the implementation at the register-transfer-level and hence, many bugs are introduced at this level. As a result, it is unlikely to catch the bugs introduced by applying the optimization techniques at the register-transfer-level by verifying a high-level model of the circuit. We have chosen combinational equivalence checking as our verification tool because of its high capacity and high degree of automation. Combinational equivalence checking compares next-state equations of signals from two circuits, which are based on only combinational circuitry driving the signals. As a result, combinational equivalence verification cannot cross the flipflops when comparing two circuits. Although this is a restriction for combinational equivalence checking, however our approach benefits from completion functions that verify pipelines one state at a time. This reduces the complexity of computations significantly and makes the pipeline verification a practical task.

3.4.2 Background on Completion Functions This section provides background on completion functions and the way this method decomposes the verification of in-order pipelines. For each in-flight instruction in pipeline, one completion function is written by verifier. Since there is one in-flight instruction in each pipeline stage, there


64

will be one completion function for each pipeline stage. The effect of applying a completion function to a pipeline stage is to complete a partially executed instruction that resides in the stage and hence update the architectural registers that are affected by completing the instruction at the stage. If all the completion functions of a pipeline are executed in an incrementally upward stream, starting from the first stage in the bottom of the pipeline, it is equal to as if the pipeline is flushed by Burch-Dill approach. Legend S1

Stage Registers Architectural Registers

S2

qs

S3

R1 R2 R3

Ns

C C C

qs’

qs

D D D

C C C

π

S1

π

Ns

π

π C B A

f1

f2

R2

D D D

C C C

C1

C1

C

D

C B A

D C B

C C B

D D C

C B B

C B

D C

C B

C B

B A A

C B A

D

C B A B A

R3

(a) Pipeline

Ni

R1,R2,R3

R2,R3

D D D

C C C

C2

C2

A

D C B C B B

C3

R3

C3

D C B

C B A

Ni

D C B

C B A

B A

S3 f3

D C B

C C C

R1

S2

qs’

D D D

qi

qi’

(b) Flushing

qi

C B A qi’

(c) Completion Functions

Figure 3.2: Simple pipeline with flushing and completion functions commuting diagram

Figure 3.2 shows Burch-Dill flushing and completion functions diagrams for a simple pipeline. The pipeline has three stage registers (S1 , S2 and S3 ) and three architectural registers (R1 , R2 and R3 ). There are three in-flight instructions (A, B and C) in the pipeline in the starting state (qi ). A new instruction (D) enters the pipeline in the next state of the implementation and specification. In Burch-Dill diagram (Figure 3.2.b), when the pipeline is flushed, bubbles enter the pipeline. The pipeline continues to be flushed until all the in-flight instructions are completely executed and the architectural registers are updated accordingly. At this stage, both the current state (qi ) 0

and the next state (qi ) stage registers will be filled by bubbles. However, all the architectural registers of the current state will contain a value that has been generated by completing instruction


65

C, and the architectural registers of the next state will contain the result of executing instruction 0

0

D. Applying the abstraction function to both qi and qi will yield qs and qs , which are the current state and the next state of the pipeline specification. According to Burch-Dill approach, to verify the correctness of the pipeline, it should be proved that taking one step in the implementation 0

0

from qi to qi and then flushing the pipeline to obtain qs is equal to flush the pipeline while it is in 0

qi to get qs and then take one step in the specification to obtain qs . On the other hand, as Figure 3.2.c shows, using completion functions, at each step one stage is verified against its completion function. As it is seen, the commuting diagram for completion functions enables stage-by-stage decomposition into multiple verification obligations. As a result, there is one obligation for each pipeline stage and one obligation between the current state and the next state of the specification. For our example in Figure 3.2.a, there are four verification obligations represented by the dotted lines. Working from bottom to top, each obligation verifies each stage of the pipeline. The shaded cells are the subset of the architectural registers that are involved in the verification obligation of each stage. We start with the third (last) stage in qi . Completing instruction A, which means applying completion function C3 to the third stage in the pipeline, updates architectural register R3 . On 0

the other hand, taking one step in the implementation from qi to qi completes instruction A and updates R3 accordingly. At this step it should be proved that in both cases R3 has received identical values (the first obligation). In the next step, completion function C2 is applied to the pipeline in qi . It reads the value of R3 that has been updated by C3 and updates R2 and 0

R3 . Accordingly, taking one step in the implementation from qi to qi will execute instruction 0

B and update R2 . Applying completion function C3 to the third stage of the pipeline in qi , will complete instruction B, which now resides in S3 and update R3 . As a result in both cases R2 and R3 will be updated and similarity of these registers in both cases will guarantee the correctness of the pipeline second stage (the second obligation). In other words, it is proved that composing completion functions C2 and C3 is equivalent, with respect to R2 and R3 , to taking an implementation step and executing C3 . Working way up the commuting diagram incrementally, eventually all the pipelines stages are verified. For the fourth obligation, the specification is


66

compared against the completion function of the first stage. The difference between Burch-Dill flushing scheme and completion functions is that in BurchDill approach, the pipeline should be completely flushed in order to be verified. In contrast, in completion functions approach, each stage is verified against its completion function at a time. In other words, Burch-Dill commuting diagram is a monolithic verification obligation, while the completion functions approach consists of several obligations. Being monolithic obligation (Burch-Dill) makes it impractical for slightly large and complex pipelines to be verified using flushing since flushing approach is computationally expensive to be performed as one obligation.

3.4.3 Approach In this section we will show how we use completion functions to perform the verification of pipelines using combinational equivalence checking. R/1,R//2,R///3 S1

S1

C1 f1

R1

S2

S2 R2

S3

R2 C3

f3

R3

R2

S3

C2

S/3

R//3 C3

R/3 f3

R3 (a) Specification

S/2

R//2,R///3

R/2 f2

R/3

S3

R1

S2

C2 f2

f1

R/2,R//3

R1

R/1

R3 (b) Implementation

Figure 3.3: Third step of simple example

In Figure 3.3, we use the third obligation of Figure 3.2.c to illustrate the combination of completion functions and equivalence verification. We create two circuits, one for the left side of


67

the completion functions diagram (specification) and one for the right side (implementation). In specification (Figure 3.3.a), each completion function Ci reads the corresponding stage register, Si and through the downstream completion functions, reads the architectural registers, Ri , Ri−1 ,..., Rn . The output of every completion function updates all the downstream architectural registers. In implementation (Figure 3.2.b), since it is assumed that one step has been taken, each completion function reads the next state value of stage registers, which is the input to the stage registers. The large gray polygons represent the second verification obligation and illustrate how we take advantage of the compositional nature of completion functions. The first and second obligations have proved that in both the specification and implementation, R20 and R300 are equal. Therefore, we only need to verify that if we apply completion function C1 to the first stage of the pipeline (f1 ) on the specification side, is equal to take one step in the implementation side and apply completion function C2 to the second stage of the pipeline (f2 ), which means R10 will receive identical values in both cases. The completions functions are implemented as combinational logic and the equivalence checker is asked to compare the output values of the architectural registers (R10 , R200 and R3000 ). The presented technique is capable of handling stalls, bypass paths and speculative execution, which are quite unique to each pipeline. However, since our case study (Sobel edge detector) is a linear pipeline with no stalls and bypass paths, we do not address these issues now. In addition, our case study does not have any architectural register, so we do not consider either how the verification technique can handle memory.

3.4.4 Case Study: Sobel Edge Detector In this section we describe how we applied our verification technique to a case study. Our case study is Sobel edge detector circuit, which as discussed in section 2.4.1, has been implemented at the register-transfer-level using VHDL. Since the completion function of each pipeline stage only updates the architectural registers of the pipeline, it can be considered as the specification of the stage, which is described as combinational logic. Therefore, running completion function of a stage means that the architectural


68

registers are read by the specification of the stage and then the specification is executed and as a result, the architectural registers are updated. For a linear pipeline (the example in Figure 3.2a) that has no stall and bypass path, our verification technique can be applied as follows: The implementation of the third stage (f3 ) is verified against its specification by combinational equivalence checking. If the verification passes, the implementation of the second stage (f2 ) is combined with the specification of f3 and then it is verified against the combined specification of f2 and f3 . Once this is complete, the implementation of f1 is combined with the specification of f2 and f3 , and the result is verified against the combined specification of f1 , f2 and f3 . As it is seen, the verification task is performed gradually and stage by stage in a upward stream manner. Since the edge detector pipeline (Figure 2.11) is linear and has no stall and bypath pass, it was quite straight forward to apply completion functions and equivalence checking to the circuit and verify it. We started by verifying the implementation of the Magnitude stage against its specification. Once this was completed, we combined the implementation of the Max4 stage with the specification of the Magnitude stage and verified the result against the combined specification of the Magnitude and Max4 stages. We continue verifying the pipeline stage by stage in a upward stream until we verified all the stages. Despite intensive simulations on Sobel circuit, our verification found two bugs in the design, which both were corner cases. The first bug was a synchronization problem between the valid bits for the different direction of edges and the second was due to an incorrect optimization, which led to a wrong assumption that two signals would always have the same value. As a result the implementation checked only one signal.

3.5 Future Work As future works, we will extend the verification method described in section 3.4 to make it more usable for verifying pipelined image processing circuitry. We will investigate whether our technique is applicable to colored image processing circuitry. We will also study to find out if the size of image affects our technique. As a potential future work, creating a general formal


69

verification method, which is applicable to result reuse and result prediction techniques can be considered. In ideal case, the verification technique can be developed such that it is applicable to these two optimization techniques (result reuse and prediction), regardless where they have been implemented (i.e. microprocessors or image processing circuitry).

Chapter 4 Conclusion This research proposal presents two design optimization techniques (result reuse and result prediction) that combine advanced data flow techniques of microprocessor design (instruction reuse and value prediction) with the locality of image data to improve the performance of image processing circuitry. Despite the fact that instruction reuse and value prediction techniques have not been yet implemented in real designs of microprocessors, the behavioral simulations of our proposed techniques (result reuse and result prediction) show a great potential for improving the performance of real time image processing. We propose different designs for implementing the two techniques including 2, 3 and n-wide superscalar pipeline and a dynamic design that maintains a constant throughput by automatically adjusting the precision. To verify the correctness of the proposed design optimization techniques, we have proposed to extend a verification technique that we have developed in a group effort. The verification technique combines completion functions with combinational equivalence checking to verify pipelined circuitries at register-transfer-level. We will extend this technique such that it covers the pipeline hazards that are unique to image processing circuitry. The scheduling for the PhD research proposal consists of 28 months or 7 academic terms as shown in Table 4.1: As success criteria, we expect to implement our proposed design techniques at the registertransfer-level as 2-wide and 3-wide (or n-wide as a general design) superscalar pipelines with 70

Conclusion

Academic Term Spring 2005

September 2005

Winter 2006

Spring 2006 September 2006 Winter 2007 Spring 2007

71

Table 4.1: Proposal Scheduling Task - Design and implement result reuse as a 2-wide pipeline at the register-transfer-level - Modify detailedness algorithm - Develop a mathematical description of precision - Extend completion functions method such that it is applicable to image processing circuitry implemented using result reuse technique - Investigate different case studies for result reuse - Implement result reuse as a 3(n)-wide pipeline - Design and implement result prediction at the register-transfer-level - Extend completion functions method such that it is applicable to image processing circuitry implemented using result prediction technique - Design and verification of a dynamic design that maintains a constant throughput by adjusting the precision automatically Writing Writing

extra hardware that is significantly less than the hardware needed to duplicate or triplicate the original hardware. We also expect to obtain speedups of up to 2 and 3 with 2-wide and 3-wide pipelines respectively with respect to the classic implementation of spatial domain algorithms. If our proposed techniques do not cover all the spatial domain algorithms of image processing, we will find out a subset of spatial domain algorithms on which, our techniques work well. Implementing a dynamic design that maintains a constant throughput by automatically adjusting the precision might be a challenging task since it is unknown to us whether we will face limitations that prevent us to increase the clock speed. To verify an image processing circuit, which uses result reuse technique, it seems necessary to distinguish between accuracy and precision. We will also extend the completion functions and equivalence checking technique in order to verify our implementations. If the completion functions verification technique does not cover all of the hazards in image processing circuitry, our alternative tactic is to extend the PipeOk verification methodology developed by Aagaard. This method, which decomposes the verification of pipelined circuits, reduces the effort needed to verify complex pipelined designs. Because this approach is a general methodology, applying

72


it to non-microprocessor circuitry seems to be feasible.

Bibliography [1] A. Sodani and G. S. Sohi. Dynamic Instruction Reuse. In International Symposium on Computer Architecture (ISCA), pages 194–205, June 1997. [2] A. Sodani and G. Sohi. Understanding the Difference between Value Prediction and Instruction Reuse. In International Symppsioum On Microarchitecture, pages 205–215, December 1998. [3] Accellera. Property Specification Language Reference Manual, June 2004. [4] Altera Corporation. Using APEX 20KE CAM for Fast Search Applications, August 1999. [5] Altera Corporation. Implementing High-Speed Search Applications with Altera Cam, July 2001. [6] Altera Corporation. APEX 20K Programmable Logic Device Family, March 2004. [7] Averanet. http://www.averant.com/products.htm/. [8] C. Kern and M. R. Greenstreet. Formal Verificaiton in Hardware Design: A Survey. ACM Transaction on Design Automation of Electronic Systems, 4(2):123–193, April 1999. [9] Cadence. http://www.cadence.com/products/functional ver/. [10] E. M. Clarke. SMV. http://www-2.cs.cmu.edu/ modelcheck/smv.html. [11] Department of Electrical and Computer Engineering,Concordia University, Montreal, Quebec, Canada. Hands-on Manual to FormalCheck Version 2.3, May 2000. 73


74

[12] E. Goldberg, M. R. Prasad, and R. K. Brayton. Using SAT for Combinational Equivalence Checking. In Proceedings of the Conference on Design, Automation and Test in Europe, pages 114–121. IEEE Press, 2001. [13] E. Jamro and K. Wiatr. Convolution Operation Implementation in FPGA Structures for Real-Time Image Processing. In Second International Symposium on Image and Signal Processing and Analysis, pages 417–422, June 2001. [14] E. M. Clarke, O. Grumberg, and D. A. Peled. Model Checking. The MIT Press, fourth edition, 2002. [15] Intel Corporation. Pentium Processor User’s Manual, 1993. [16] J. A. Boluda, F. Pardo, F. Blasco, and J. Pelechano. A Pipelined Reconfigurable Architecture for Visual-Based Navigation. In EUROMICRO Conference, volume 1, pages 71–74, September 1999. [17] J. Higgins and F. Khalvati. Formal Verification of an Instruction Pipeline Framework Using FormalCheck. Technical report, Department of Electrical and Computer Engineering, University of Waterloo, 2002. [18] J. Jang, S. Qadeer, M. Kaufmann, and C. Pixley. Formal Verification of FIRE: a Case Study. In Proceedings of the 34th Annual Conference on Design Automation, volume 0, pages 173–177. ACM Press, 1997. [19] J. L. Brelet. Using Block RAM for High Performance Read/Write CAMs. Xilinx, May 2000. [20] J. P. Shen and M. H. Lipasti. Modern Processor Design. McGraw-Hill, 2004. [21] J. R. Burch and D. L. Dill. Automatic Verification of Pipelined Microproccessor Control. Proceedings of the 6th International Conference Compute-Aided Verification, 818:68–80, 1994.

Conclusion

75

[22] J. Sawada and W. A. Hunt, Jr. Results of the Verification of Complex pipelined Machine Model. In L. Piere and T. Kropf, editor, CHARME’99, volume LNCS 1703, pages 313–316. Springer-Verlag, 1999. [23] K. Cater, A. Chalmers, and G. Ward. Detail to Attention: Exploiting Visual Tasks for Selective Rendering. In P. Christensen and D. Cohen-Or, editor, Eurogtaphics Symposioum on Rendering, pages 270–280, June 2003. [24] Cadence Berkley Laboratory. SMV. http://www-cad.eecs.berkeley.edu/ kenmcmil/smv. [25] M. D. Aagaard.

A Hazards-based Correctness Statement for Pipelined Circuits.

In

CHARME, pages 66–80, 2003. [26] M. D. Aagaard, B. Cook, N. A. Day, and R. B. Jones. A framework for superscalar microprocessor correctness statements. International Journal on Software Tools for Technology Transfer, 2144:298–312, December 2003. [27] M. D. Aagaard, V. C. Ciubotariu, J. T. Higgins, and F. Khalvati. Combining Equivalence Verification and Completion Functions. In A. Hu and A. Martin, editor, FMCAD, volume 3312 of Lecture Notes in Computer Science, pages 98–112. Springer, 2004. [28] M. R. A. Huth and M. D. Ryan. Logic in Computer Science. First edition, 2002. [29] MentorGraphics. http://www.mentor.com/products/fv/formal verification/formal pro/index.cfm/. [30] N. Day. Computer-Aided Verification Coursenotes. University of Waterloo, Waterloo, Ontario, Canada, 2004. [31] N. Mokhoff. Intel, Motorola Report Formal Verification Gains. EETimes, June 2001. [32] N. Narasimhan and R. Vemuri. Specification of Control Properties for Verification of Synthesized VHDL Designs. In M. Srivas and A. Camilleri, editor, Formal Methods in Computer-Aided Design, First International Conference, volume 1166, pages 327–345. Springer, November 1996.


76

[33] P. Hsiao, C. Hua, and C. Lin. A Novel Architectural Implementation of Pipelined Thinning Algorithm. In IEEE International Symposium on Circuits and Systems, volume 2, pages 593–596, 2004. [34] P. J. Windley and M. Coe. A Correctness Model for Pipelined Microprocessors. In R. Kumar and T. Kropf, editors, Proc. 2nd International Conference on Theorem Provers in Circuit Design (TPCD94), volume 901, pages 32–51, 1994. [35] P. Mishra, N. Dutt, A. Nicolau, and H. Tomiyama. Design, Automation and Test in Europe Conference and Exhibition. In Proceedings of the conference on Design, automation and test in Europe, page 36. IEEE Computer Society, 2002. [36] P. Mishra, N. Krishnamurthy, N. Dutt, and M. Abadir. A Property Checking Approach to Microprocessor Verification using Symbolic Simulation. In Proceedings of Microprocessor Test and Verification (MTV), June 2002. [37] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Prentice Hall, second edition, 2002. [38] R. Drechsler. Towards Formal Verification on the System Level. Proceedings of IEEE International Workshop on Rapid System Prototyping, pages 2–5, 2004. [39] R. Hosabettu, M. Srivas, and G. Gopalakrishnan. Decomposing the Proof of Correctness of Pipelined Microprocessors. In Lecture Notes In Computer Science:Proceedings of the 10th International Conference on Computer Aided Verification, volume 1427, pages 122 – 134. Springer-Verlag, 1998. [40] S. Tahar, E. Cerny, and X. Song. Formal Verification of Systems. Technical report, Department of Electrical and Computer Engineering, Concordia University, 2000. [41] S. Y. Eun and M. H. Sunwoo. An Efficient 2-D Convolver for Real-Time Image Processing. In IEEE International Symposium on Circuits and Systems, volume 2, pages 429–432, 1998.

Conclusion

77

[42] Synopsys Inc. Synopsys Forality Datasheet, 2004. [43] T. Coe, T. Mathisen, C. Moler, and V. Pratt. Computational Aspects of the Pentium Affair. IEEE Computational Science and Engineering, 2(1):18–30, 1995. [44] T. Kropf. Introduction to Formal Hardware Verification. Springer, 1999. [45] University of Cambridge. http://www.cl.cam.ac.uk/Research/HVG/HOL/. [46] University of Texas. http://www.cs.utexas.edu/users/moore/acl2/. [47] Y. Huang and T Chang. A Fuzzy Model for Image Segmentation. In IEEE International Conference on Fuzzy Systems, volume 2, pages 972–977, May 2003. [48] Y. Ninomiya, S. Matsuda, M. Ohta, Y. Harata, and T. Suzuki. A Real-Time Vision for Intelligence Vehicles. In Intelligent Vehicles ’95 Symposium, pages 315–320, September 1995.

Design and Formal Verification of Image Processing ...

Design and Formal Verification of Image Processing ...

Suggest Documents

Image Processing with Scilab and Image Processing Design Toolbox

Image Processing with Scilab and Image Processing Design Toolbox

fingerprint verification based on image processing

fingerprint verification based on image processing ... - CiteSeerX

Getting Formal Verification into Design Flow - People.csail.mit.edu

Getting Formal Verification into Design Flow - People.csail.mit.edu

Integrating Design and Formal Verification of Java ... - Semantic Scholar

Design, Optimization, and Formal Verification of Circuit Fault ...

Formal Design and Verification of Long-Running ... - Semantic Scholar

Integrating Design and Formal Verification of Java Card Programs

Design, Optimization, and Formal Verification of Circuit Fault ...

Economics of Software Verification - Spin - Formal Verification

An Improvement in Formal Verification - Spin - Formal Verification

Formal Verification Bootcamp

Formal Verification of Some Potential

Formal Verification of Quantum Protocols

formalization and verification of shared memory - Formal Verification at

Specification and Formal verification of security requirements

formal verification of fault detection and service

Automatic Synthesis and Formal Verification of ...

Modeling and Formal Verification of IMPP

Formal Verification of Fault-Tolerant Software Design: The CSP ...

From Model-Based Design to Formal Verification of ... - CiteSeerX

Design of Software Security Verification with Formal Method Tools