Implementation of Image Processing Algorithms on FPGA the BMP format and saves it to the disk. Thus, we have been succes
Implementation of Image Processing Algorithms in FPGA Hardware
Embedded Systems (CS 684) Course Project
Ashutosh Dhekne (07305016) Advait Mishra (07305032) Department of Computer Science and Engineering, IIT Bombay
Acknowledgements We express our sincere thanks to Melissa Fernandez for providing us vital insights in the use of the RC10 board and the software to program it. Her encouragements have pushed us to keep going at times when we would have otherwise stopped. We wish to thank Prof. Kavi Arya for giving us an opportunity to work with the FPGA. His continual interest in the project gave us satisfaction at each milestone.
Implementation of Image Processing Algorithms on FPGA
1 Introduction Continued exposure to higher-level programming languages such as C and Java, had made us forget the fact that we are actually executing the programs that we write on real hardware. These languages hide from us the various machine dependant details including memory management, endian-ness, and buffering. However, the true nature of computing manifests itself when we do not have this soft cushion underneath. Through our project on the FPGA, we were exposed to 1.2 million tiny gates that could be rearranged to produce a new hardware at a click of a button, and of course, at a fistful of waiting time. This project report is intended to be a starting point for anyone who wishes to use the RC10 board to build custom hardware. The report is organized as follows. We first present our project setting and describe what we have done with it. Then we present what problems we faced. Finally, we make suggestions to future users of this board.
2 The Image Processing Library Project We intended to build an image-processing library for the RC10 board. We envisioned that a user sitting on a computer would submit an image along with a stack of instructions to the FPGA. The FPGA then would process the image through the given instructions, applying the image processing algorithms one after another, in a pipelined manner whenever possible. This vision remains a dream. It was not completed. However, we did get an image passed on to the FPGA, get the FPGA to apply a single algorithm on it and return it to the computer. The complete picture we had envisioned, however, is a necessary impetus to get a part of it done. We have created a program that converts BMP images to a simplified format. It contains only the width, the height, the total size and the information about the color of each pixel. This raw file is written to a specific location in the FPGA’s Flash. The FPGA is configured with the specific algorithm the user wishes to perform on the image. The FPGA runs this algorithm on the raw data and sends it back, in chunks, to the computer through the USB cable. The program on the computer receives this data converts it into
Implementation of Image Processing Algorithms on FPGA the BMP format and saves it to the disk. Thus, we have been successful in building an end-to-end solution. We now describe, in detail, each of these modules.
2.1 The Converters The BMP header stores vital information regarding the image. It makes it possible, for example, to support multiple ranges of colors. However, when we have a specific application, such as this, the generic nature of the headers can well be done without. We created a simpler format to store only the information relevant to us.
Name
Size (bits)
Description
Width
32
The width of the image
Height
32
The height of the image
FileSize
32
The size of the image after incorporating the size of the BMP header. It is the size of the original BMP image.
Pixel Data (Red)
8
The Red color value of the pixel
Pixel Data (Green)
8
The Green color value of the pixel
Pixel Data (Blue)
8
The Blue color value of the pixel
…
…
…
The BMP format, in contrast is much more complicated. It has been included below for completeness.
Name
Size (bits)
Purpose
Identifier
16
the magic number used to identify the BMP file: 0x42 0x4D (ASCII code points for B and M)
FileSize
32
the size of the BMP file in bytes
Reserved
32
Application specific
Offset
32
Starting address where the data begins
HeaderSize
32
The size of this header (40 bytes)
Implementation of Image Processing Algorithms on FPGA Width
32
The bitmap width in pixels
Height
32
The bitmap height in pixels
ColorPlanes
16
The number of planes being used. Must be 1
ColorDepth
16
The number of bits per pixel. Typical values are 1, 4, 8, 24, 32
CompressMethod
32
The compression method being used
ImageSize
32
Size of the raw data
HorizontalResolution
32
The horizontal resolution of the image. Pixels per meter
VerticalResolution
32
The vertical resolution of the image. Pixels per meter
NumberOfColors
32
The number of colors in the color palette
ImportantColors
32
The number of important colors used.
We allow only 24-bit color depth images to be converted to our format through our converter. We found that this fulfils our needs. Since the raw file format we propose also preserves the original filesize information, and since we assume all other parts of the BMP header to be known or resolvable, we are able to recreate a BMP file from the raw format. We must caution you, however, that we seem to have missed out on populating some field in the BMP header, due to which a subset of image viewers are unable to display images converted from the raw format to BMP. We ignored this bug since we could always use other image viewers that were more tolerant.
2.2 Writing to FPGA We need to be able to write to the FPGA flash to copy the raw image for the FPGA to process. This is done using the RCFlashErase, RCFlashAppendBegin and the RCFlashAppend functions. We also need to configure the FPGA with the required algorithm file. This is done using the RCConfigureBoard function. We appreciate the fact that both these operations are extremely fast. It is only when we read randomly from the Flash that we encounter trouble as described in Section 5.2.
Implementation of Image Processing Algorithms on FPGA
2.3 Running the algorithm Immediately after the FPGA is configured with the algorithm, it starts processing the image stored at the known location. Precise details of processing of the image depend on the algorithm being executed. The color to grayscale conversion algorithm is the simplest of the algorithms. It reads the image from the Flash memory in chunks of fixed size and stores it in an array. It averages each pixel’s color values and sends this average value to the computer through the USB module. We have to keep in mind that the microcontroller that controls the USB cable and the Flash memory is the same, and therefore, cannot execute any USB related functions after issuing an RC10FlashReadBegin and before reading all the required bytes from the Flash. This is why we have to buffer bytes in an array and then send this chunk. We may not buffer the entire image at once for two reasons. Firstly, dynamic memory allocation is not possible on an FPGA, and hence we have to create a very big static array. The second reason is that, building a very big array takes up many gates on the Flash. This might exceed the capacity of the device. In any case, compiling the program with a large array takes very long. We describe this point again in Section 05.5 when we speak about the problems we faced in implementing our algorithms. In convolution algorithms, we read three consecutive rows and store them together in buffers. The convolution mask is progressively applied over consecutive columns of the buffered rows. The result of one such operation is the color values of one row of pixels. These values are handed over to the computer through the USB module and next three rows are buffered. From the computer program’s perspective, it just configures the device with the relevant algorithm and then waits for data to appear on the USB cable. Depending upon the algorithm being run, the quantities of data sent together by the FPGA changes, but the application program running on the computer can remain oblivious to this and can naively store the values it gets into a file.
Implementation of Image Processing Algorithms on FPGA
3 Results Obtained We currently support the color to grayscale conversion algorithm, convolution algorithm and histogram. The color to grayscale conversion and the histogram runs without any constraints on the size of image. Convolution algorithm can have limited width.
3.1 Color to Grayscale Each pixel of the image contains three bytes representing each color. To obtain a grayscale image, we have to average the color value for each pixel. The program buffers 64 consecutive pixels, calculates the average and sends the new color values of each pixel to the computer through the USB cable. The program does not impose any size limits on the image, but larger images take more time to produce results. Shown below is an example input image and the result of grayscale conversion.
3.2 Histogram The histogram program takes a color image as input. It calculates the number of pixels with a particular color value of the three colors in R-G-B order. For each color, an array of 256 locations is created and the number of pixels, in the entire image, of a particular color level is recorded in the respective array. This information is pushed to the USB cable starting from number of pixels with color value equal to zero in the R-G-B order. The host program must catch this information and display it as it deems fit.
3.3 Sobel Edge Detection We have applied two Sobel filters one after the other to perform edge detection in both directions. For this, we modified our convolution algorithm to do the same computation twice. However, since we did not have a sqrt function, we had to approximate the merging of two filters using simple addition operator. We have some errors in the results we obtained.
Implementation of Image Processing Algorithms on FPGA
3.4 Convolution The convolution program reads the convolution mask from the Flash and reads the amount of shift required to approximate the division operation. Originally, we had put independent statements inside par blocks, but we saw drastic improvements in the compile time when the par was removed. The mask is applied over the image starting from the first row. Remember that since BMP images are stored starting from the bottom, the first row is the bottommost row. The program buffers three consecutive rows and then applied the convolution mask on new columns from left to right. The result of the convolution is stored in the center row left pixel. Due to this, the output image is shifted to the left by one pixel. The mask does not convolute with the last one or two columns depending upon whether the width of the image is a multiple of three (the mask size). After the result of one row is available, it is returned to the computer through the USB. An inherent inefficiency in the program is present because we are buffering three rows at a time and then next time, we again buffer two of these rows. This improvement is left for future work. The results for two different masks are shown below. The first one is the original image, the second one is Gaussian smoothened and the third is smoothened by an all ones mask. The third figure is lighter because we have approximated the division-by-nine with a shift-by-three.
4 Shortcomings Initially, we had a shortcoming where we had hardcoded the convolution mask inside the FPGA code. We understood that it would be good for the mask to be taken from a fixed
Implementation of Image Processing Algorithms on FPGA location on the Flash than being hardcoded in to the program. This way, only a single bit file will have to be written on the FPGA for any convolution algorithm. The mask can then be changed whenever required by writing a new mask file to the Flash. We do not have the square root operator implemented as a function in Handel-C. Therefore, in the edge detection algorithm, we had to approximate using addition. This causes some errors in the output image. Fixing a buffer size is one of the most difficult tasks. We had a tradeoff between compilation time, execution time and number of gates required. A quick check that we performed showed that color to gray conversion using buffer size as large as 4096 bytes produced drastic improvements over the time required to get the data on the computer as compared to using a 64-byte buffer, see also Section 5.5. However, it also increased the compilation time to an unacceptable degree. We must appreciate the fact that during development phase, compilation time is very important since it dictates how much work is done within a specific amount of time. Since most of student projects spend much more time in this phase, we have used smaller buffers in many places so that we spend lesser time compiling our code.
5 Our Experiences In this section, we highlight our observations and the difficulties we faced during this project. The principal reason of including this section is not so much to whimper about our inabilities to complete the promised but to provide a starting point to those who may wish to work on the RC10 board in the future. When we started working on the board, there was only minimal support available. We hope to improve this situation.
5.1 No Static RAM The RC10 board does not have static RAM. Any memory required in the form of variables in the Handel-C program is thus mapped to actual hardware blocks. We may declare a variable to be kept in RAM by using the ram keyword before the declaration of the variable. ram unsigned int 32 my_variable_name;
We must always keep in mind, however, that we cannot access multiple variables declared with the ram keyword in a par block. This behavior is expected since only a
Implementation of Image Processing Algorithms on FPGA word of memory can be accessed through the ram at a time. There are restrictions on the size of the variable declared as ram. We cannot have, for example, a variable stored in ram to be 13 bit wide. This again follows from the way ram is constructed. Since the RC10 board does not posses static RAM, we could not use any variables through the ram keyword. However, Handel-C does not complain even if we use such variables, and it is a major source of error. • Do not use multiple variables declared as ram in one par block. • Check the RAM support for your board in the manual. • The RAM does not allow arbitrary bit widths.
5.2 Flash Operations Unlike the random access memories, to which we are used to in the software paradigm, Flash memories require different time to write contiguous bytes in contrast to bytes written in random order. Flash memories are extremely fast in reading or writing one entire line. However, they require more time (10 to 100 times more) to write bytes to different lines. The RC10 board supports two functions to perform flash writes. RC10FlashAppendBegin(Index, Length) RC10FlashAppend(Byte)
The first function initializes the writing to the flash at the location specified by the Index. RC10 board has 256 indexes. The second function actually appends to the flash memory at that index. We must not call any other Flash or USB read or write functions before calling the RC10FlashAppend function Length times after issuing the RC10FlashAppendBegin function. Think of this as thread unsafe functions using static
variables. Remember that these functions always append to the existing data at that Index location. You cannot overwrite a single byte. To erase an entire Index location and automatically initialize its size to zero we must use the following function. RC10FlashErase(Index)
Another very important point to remember in use of these functions is that we must call the RC10MicroRun function in parallel with any of the Flash read/write functions. The FPGA has no mechanism to complain if you forget to run this function. It simply
Implementation of Image Processing Algorithms on FPGA will not do the intended task. The parameter passed to this function is the current hardware operating clock speed. RC10MicroRun(ClockSpeed)
Issuing RC10FlashAppendBegin statements for writing small chunks of data repeatedly is not a good idea. We saw a lot of unpredictable behavior from the flash. The following code snippet does not work correctly. do { RC10FlashAppendBegin(140, 1); RC10FlashAppend(i); i++; }while (i < 1000);
Instead, if the RC10FlashAppendBegin function is called only once before the loop but issues request to write 1000 bytes, it works correctly. RC10FlashAppendBegin(140, 1000); do { RC10FlashAppend(i); i++; }while (i < 1000);
Similar to the constraint on the RAM blocks, we may not run multiple Flash read/write functions in a par block. Signal assertion for multiple read/writes cannot be done in parallel. • Writing to a new line is expensive in time. • Cannot run multiple Flash read write in parallel • Forgetting to issue as many read/write calls as promised in a …begin statement may cause the Flash to become corrupt. A Flash format will be required • Call the RC10MicroRun function in parallel with any Flash read/write function calls
5.3 Mathematical operators Handel-C directly supports the addition, subtraction and multiplication operators. So are the various bit wise operators. However, division is not supported through the / operator. It is supported in the form of a function div. This is to suppress the usage of the
Implementation of Image Processing Algorithms on FPGA computationally expensive division. We may approximate division through various bit shift operations, which are into the hardware much efficiently than a generic division operation. The tradeoff we face here is the gate count versus accuracy. A recommended way to reduce calculations is to use lookup tables burned into the FPGA. It is common to have an entire logbook in hardware if the application calls for it. In real systems, such lookup tables are built into ROM chips and can be accessed almost at the CPU’s clock speed or at least at the speed of the RAM. • Use simple bitwise operators to approximate complex mathematical functions • Build lookup tables when possible
5.4 Expression Simplification A big expression consumes lot of gates and slows down the entire system. It is advisable to break bloated expressions into, preferably unrelated, blocks. Computing subexpressions in parallel usually can boost performance of the entire system. However, there is a tradeoff. Every statement inside a par block requires its own circuitry. The program cannot share common hardware since it runs in parallel. This increases the gate count. A recommended design strategy should be to build aggressively as much parallelism as possible and back step to sequential execution in the case that the number of gates is insufficient. We observed that the compile time reduces drastically if par blocks are not used. You may want to consider removing par blocks during the debug phase for quicker compilations. When par blocks are not used, the variables may be declared as ram to further reduce the compilation time.
5.5 Arrays FPGA, in our opinion, is not very efficient in creating memory blocks. We must use it fundamentally for parallel computations and specialized devices like RAM, ROM and Flash memories must do any buffering or storage. RC10 lacks support of a separate RAM memory module. Hence, storage has to be created in the FPGA itself. We observed that creating large arrays accessed from multiple parts of the program drastically increase the compilation time required to create the bit file. We were naively astonished to see that the bit file was always the same size no matter how much time it took to build the file! This
Implementation of Image Processing Algorithms on FPGA is expected, since the bit file only contains the configuration and interconnection of the various gates in the system. An entire description is always required. When we use fewer gates, much of this description is simply to do nothing. We tried to see the difference between multiple single dimensional arrays and a single multidimensional array. However, in both cases, it took so much time, to the tune of several tens of minutes, that we lost count. We still have an intuition that multidimensional arrays may require more intelligence on the part of the compiler than multiple single dimensional arrays. The reason why we used buffers is to increase the throughput. As noted in Section 5.2 the same microcontroller controls both the USB as well as the Flash memory. It is inefficient to read single bytes from the Flash and issue a new RC10FlashReadBegin for each byte being read. However, we cannot push bytes out on the USB before we have read all bytes requested in the call to RC10FlashReadBegin lest it corrupts the Flash. Hence, we have to buffer the processed byte before we can send them on the USB—we require arrays to accomplish this. In the color to grayscale conversion program, we have tried using a buffer of 4096 bytes and have seen a drop in execution time from about a couple of minutes when we used a buffer of 64 bytes to about a couple of seconds. We also suggest use of the ram keyword when using array for just buffering. Remember that multiple locations of this array cannot be accessed in parallel, but you would usually not require parallel access to an array only used as a buffer. The buffer has the purpose of intermediate storage after we have retrieved information from the Flash and we are yet to send it to the USB cable, or vice versa. • Building arrays takes too much compile time and gates • Use ram based arrays for buffering • Multiple single dimensional arrays may require lesser time to compile than a single multidimensional array
5.6 VGA output from RC10 Board Our initial plan was to display the processed images directly on the screen from the FPGA. However, RC10 does not have library routines to simplify displaying images on the screen. Though we could get the VGA display some color patterns on the screen, displaying an image involves synchronization issues. Without the support of relevant
Implementation of Image Processing Algorithms on FPGA libraries, and due to the lack of static RAM, it was not trivial to show the images on the screen. It can be considered as a future implementation.
6 Conclusion This project provided us an entirely new perspective in writing code. We got a novel experience in handling hardware that is extremely powerful and about as tunable as software. We reached a stage of a respectable working system crossing the numerous pitfalls we came across. Many student ideas can be effectively kindled and matured by prototyping them on an FPGA. We expect our project to have produced the required starting points for more sophisticated ventures. Future teams may start at the points we left dangling. The FPGA presents many challenges; still, overcoming them is satisfying and enjoyable.