Hardware object model and its application to the image ... - CiteSeerX

1 downloads 144 Views 440KB Size Report
the application development in the reconfigurable computing sys- tem, a hardware module ... of the ObjectManager, the programmers can use the hardware objects like the usual ...... mercial compiler C++Builder with the HwObject li- brary is used to .... [13] Opencores web site, http://www.opencores.org/. [14] S. Mallat, “A ...
IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.3 MARCH 2004

547

PAPER

Special Issue on Applications and Implementations of Digital Signal Processing

Hardware object model and its application to the image processing Kenji KUDO†a) , Yoshihiro MYOKAN†b) , Winh CHAN THAN†c) , Shinji AKIMOTO†d) , Nonmembers, Takashi KANAMARU†e) , and Masatoshi SEKINE†f) , Members

SUMMARY To realize the hardware object which facilitates the application development in the reconfigurable computing system, a hardware module (HwModule) is proposed and implemented. To access the circuit in the HwModule from the standard PC without detailed knowledge of the hardware, an object manager (ObjectManager) is also implemented. With the help of the ObjectManager, the programmers can use the hardware objects like the usual software objects. The HwModule is applied to the image matching, and the easiness of the application development for the HwModule is confirmed. key words: reconfigurable computing system, hardware object model, wavelet transformation, template matching

1.

Introduction

In the application-specific domain, the field programmable gate array (FPGA) has been widely used as a replacement for ASICs because it is suitable for rapid prototyping and its time-to-market is very fast. Noticing the reconfigurability of the FPGAs, the studies on the reconfigurable computing system have been attracting considerable attentions[1]–[7]. The structure and the basic function of the usual Neumann computer are fixed, and its flexibility is brought about by the programs stored in memories. On the other hand, in the reconfigurable computing systems, the structure of the system could be flexibly altered by reconfiguring its programmable devices. However, to obtain popularity, the reconfigurable computing system must overcome some problems. The main problem is the difficulty to develop applications for the reconfigurable computing system because the detailed knowledge of the circuit is required. Despite the problems, the reconfigurable computing system is used to disperse the load of tasks, and it can be reconfigured to process other tasks. With such a method, SPLASH2[6] performed the pattern matching Manuscript received June 27, 2003. Department of Electrical and Electronic Engineering, Faculty of Technology, Tokyo University of Agriculture and Technology, Tokyo 184-8588, Japan a) E-mail: [email protected] b) E-mail: [email protected] c) E-mail: [email protected] d) E-mail: [email protected] e) E-mail: [email protected] f) E-mail: [email protected]

for genetic databases, fingerprint matching, and highspeed image matching, and drastic speedups for those tasks were realized. On the other hand, to facilitate the application development for the reconfigurable computing system, the object-oriented method to deal with the hardware as an object is proposed by several authors[7]–[10]. In Refs. [7]–[9], methods for designing the circuits from the gate level in the standard programming language such as C++[7] and Java[8], [9] are proposed, and it is suggested that the software programmers can design the hardware without learning any hardware-description languages. These researches are concerned with the partial reconfiguration in an FPGA, thus their hardware objects mainly encapsulate the circuits with relatively small number of gates, and their models could not receive benefits from the usual EDA tools. On the other hand, Davis et al.[10] proposed the hardware objects which are designed by hardware engineers with the standard hardware description languages such as Verilog-HDL and VHDL. The software programmers purchase or create the hardware objects which are suitable for their purpose such as the discrete cosine transformation, and can use them like the usual software objects in their applications. In this method, the details of the hardware are hidden by the abstraction layer from the software programmer, thus the programmer can spend their time and energy in developing their applications. Although their hardware objects are attractive, they present only the idea, and their hardware objects are not implemented. Motivated by the hardware object of Davis et al., we implement a platform named as a hardware module (HwModule) to execute the hardware object (HwObject)[11]. The HwModule is attached to the PCI bus of the standard Windows PC, and has three FPGAs for the applications. Moreover, the object manager (ObjectManager) to access the HwModule from the host computer is also implemented. With the ObjectManager, the circuit (HwNet) in the HwModule is easily accessed from the standard C++ compiler and can be used as an object in the application. We apply our HwModule and HwObjects to the multi-level matching for detecting faces in purpose of having a feasibility study to find out whether they will

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.3 MARCH 2004

548

work in the digital signal processing. In general, in the digital signal processing, stream data are processed in the pipelined operations, but the tasks are often different dependent on the type of data, e.g., sound and image, and also on the purposes, e.g., data compression and recognition. Thus, those various tasks are usually realized with the software and executed on the DSPs. However, when the process of a large quantity of data is required, the load of the DSP becomes enormous, and its power consumption is not negligible in the end. In such a situation, the parallel signal processing might be effective, but it contradicts the sequential executions of the software. Our HwObjects encapsulate the part of parallel signal processing such as pipelined operations, and they can be embedded in the software. As for the data transfer in the digital signal processing, the usual personal computer system has only one standard PCI bus, and its speed of the data transfer is not very fast. Even if the new specifications such as PCI-X or PCI express are realized, they might not be sufficient for enormous quantities of data for the digital signal processing. Thus, before transferring the data to the host processor, the pre-processing and the compress of the raw data shall be required. In such a situation, the application in the host computer and the module for pre-processing must cooperate with each other, thus the simple separation of the software and the hardware might be difficult. As for our HwObjects, they can be used like the usual software objects, thus the cooperation of the software and the hardware is easily realized. Moreover, when other tasks are required, they can be dynamically replaced by other HwObjects. In short, the hardware or the circuits are embedded as active objects in the software or the programs. The present paper is organized as follows. In Sect. 2, the structure of the HwModule is briefly summarized. In Sect. 3, three examples of the HwNet, namely, the control circuit of camera, the wavelet transformation circuit, and the template matching circuit, are presented. In Sect. 4, the implementation of the HwObject is explained. With the help of the ObjectManager, the HwObject can be used as the usual objects from the application without the detailed knowledge of the hardware. In Sect. 5, our HwModule and HwObjects are applied to the image matching. The conclusions and discussions are given in the final section. 2.

Structure of HwModule

As shown in Fig. 1, our hardware module (HwModule) is a memory device which is attached to the PCI bus of a host computer, and it is composed of one FPGA (Xilinx XC2S200) for PCI and local bus (LB) bridge with a HwNet controller (PCI/LB interface), three local 1MB memories (LM), three FPGAs (Xilinx XC2S200) for virtual circuits (HwNets), a 32-bit microprocessor,

host computer host processor

main memory (MM) SwObject SwObject

PCI-bus

HwObject HwObject

HwModule

LM1 (SRAM)

HwModule local bus LM1 (SRAM)FPGA1 PCI/LB interface

LM2 (SRAM)

LM3 (SRAM)

LM2 (SRAM)FPGA2

local busHwNet (FPGA) HwNet PCI/LB FPGA1 interface HwNet control bus (FPGA) HwNet Microprocessor

HwNet HwNet FPGA2

LM3 (SRAM)FPGA3 HwNet HwNet FPGA3

HwNet HwNet HwNet CPUHwNet memory (CM) (SRAM)

control bus GPIF

Microprocessor

CPU memory (CM) (SRAM)

GPIF

Fig. 1

A block diagram of HwModule.

and general purpose interface ports (GPIF). In principle, multiple HwModules can be attached to the host computer, but, in the present manuscript, we consider only a single HwModule. The HwModule is integrated as a memory-like device because of its easy access from the functions of the applications. The hardware object (HwObject) is implemented on FPGAs in the HwModule as HwNets, and its data, functions, and the interface from the host are implemented in main memory (MM). The HwNet in FPGAs which are not active can be reconfigured even if other FPGAs are active. The microprocessor on the HwModule performs the communication with the host computer, the configuration of FPGAs, and the allocation and deallocation of LMs for HwObjects. It takes 811 ms to configure an FPGA for the software of the microprocessor. On the other hand, it is confirmed that the configuration time of an FPGA is shortened to 50 ms when the trial circuit implemented on FPGA0 which uses JTAG[12] for downloading configuration bitstreams to FPGAs is used for the configuration. Thus we will use this circuit for the configuration of FPGAs in the future. In the memory for the microprocessor (CM), the program for the microprocessor and the temporal bitstream data of HwNets are stored. Two FPGAs share the one LM, and a single FPGA can access two LMs on both sides. Thus the data transfer is not required when two FPGAs cooperate, and the HwModule can perform pseudo-pipelined operations. On the other hand, all the three LMs can be accessed from each FPGA and the host computer as a single memory. Such a behavior is realized by the peripheral circuit in the FPGAs controlled by PCI/LB interface as shown in Fig. 2. Two LMs are connected with each other by the peripheral circuit when requested by the PCI/LB interface. Moreover, the HwNets must request to the PCI/LB interface to get a grant for access to the

KUDO et al.: HARDWARE OBJECT MODEL

549

LM1

LM2

local bus

Local Bus

HwNet Local Bus Controller

i2c Master Controller

DC5V External Power Supply

SDA SCL Y[7:0]

local bus controller PCI/LB interface

interface to MPU

A peripheral circuit and a HwNet in the FPGA1.

Fig. 3

A picture of HwModule.

External DC 5V

To HwModule GPIF

CMOS Camera Chip OV7620 Fig. 4

A camera module.

local bus. A picture of the HwModule is shown in Fig. 3, and the terminologies used in this paper are summarized in Table 1. They will be explained in the following sections. 3.

HwNet HwModule

FPGA1

Fig. 2

Control Camera Main

peripheral circuit controlled by PCI/LB interface

HwNet

HwNet libraries

To implement on our HwModule, we developed some HwNets. They are supplied as libraries together with the drivers, and compose the HwNet libraries. And the programmers can use them as the HwObjects. 3.1 Control circuit of camera We designed a control circuit of camera as a HwNet. It controls a camera module which is fabricated using a VGA CMOS camera (image sensor) OV7620 as an

Image Signal Decoder

PCLK HREF VSYNC

OV7620 CMOS Image Sensor CameraModule

Fig. 5 The image capturing system for the HwModule. PCLK, HREF, VSYNC, and Y denote the basic clock of camera, the horizontal window reference signal, the vertical synchronize signal, and the image signal, respectively. SDA and SCL denote the serial data and the serial clock for i2c bus, respectively.

input device. As shown in Fig. 4, the camera module is connected to the HwModule through GPIF. Here we simply describe the inside block of the control circuit of camera, as shown in Fig. 5. The LocalBusController block controls the input/output for LMs. It reads the request command from the host and writes the image data decoded from the signal of camera. The I2cMasterController block creates the i2c† signal to set the registers of camera, and we use a core published by the opencores [13]. The ImageSignalDecoder block decodes the signal from the camera, and output it as a image data. The ControlCameraMain block, which combines the controls for whole the circuit, sets the value of the registers and writes the image data to LM from the camera. This circuit can capture the images from the camera in real time (monochrome 640×480, 15 frames per second). 3.2 Wavelet transformation circuit A HwNet which applies the Haar wavelet transformation[14] to the two-dimensional image is designed. As shown in Fig. 6, when the Haar wavelet transformation is applied to the image, the four images which represent the low-frequency components, the high-frequency components for the vertical direction, the high-frequency components for the horizontal direction, and the high-frequency components for the diagonal direction, are obtained. Those components are defined as 1 n n+1 n = (C2k,2l + C2k+1,2l Ck,l 4 n n +C2k,2l+1 + C2k+1,2l+1 ), (1) 1 n+1 n n = (C2k,2l + C2k+1,2l Dk,l 4 n n −C2k,2l+1 − C2k+1,2l+1 ), (2) † Short for Inter Integrated Circuit. A type of bus designed by Philips Semiconductors in the early 1980s to provide an easy way to connect a CPU to peripheral chips. It uses a simple bi-directional 2-wire, serial data (SDA) and serial clock (SCL) bus for inter-IC control.

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.3 MARCH 2004

550 Table 1 HwModule HwObject SwObject HwNet HwModuleDriver HwManager HwNetDriver ObjectManager

The terminologies used in this paper.

FPGA board attached to the PCI bus the object realized with the hardware but used like softwares the usual object programmed with the software language the circuit in FPGA and the substance of the HwObject a device driver for HwModule which is attached to the OS when it wakes up an object which manages the HwModule and is implemented as an API a driver for the HwNet which is attached to the HwManager an object which manages HwObjects, SwObjects, and HwManager for their cooperations

Template image

ain

MUX

W bin0 bin1 bin2

MUX

C1 D 1

C0

MUX

E1

F1

bin[7:0]

C2 D2 E2 F2

PE

Sum [15:0]

sum= |a in - b in | Clear

Fig. 8

C3 D3 E3 F3

MUX

Clock

ain[7:0]

The Haar wavelet transformation of Lenna.

Fig. 7

DFF

PE

.......................

MUX

C1 D1 E1 F1

PE

DFF

PE

DFF

PE PE

DFF

COMPARATOR

MUX

W -1

C0

PE

DFF

min_x

Input image

Fig. 6

DFF

min_y min_ value [15:0]

....................... MUX

DFF

PE

A circuit for template matching (PE32).

and a value of matching. For the template image A(x, y)(0 ≤ x, y ≤ 15) and the input image B(x, y)(0 ≤ x, y ≤ 46), this circuit outputs a position (x, y) in the input image where M (x, y) =

A sequence of the wavelet transformations.

15,15 

|A(i, j) − B(i + x, j + y)|

(5)

i=0,j=0

1 n n (C − C2k+1,2l 4 2k,2l n n +C2k,2l+1 − C2k+1,2l+1 ), 1 n n = (C2k,2l − C2k+1,2l 4 n n −C2k,2l+1 + C2k+1,2l+1 ),

n+1 Ek,l =

n+1 Fk,l

(3)

(4)

0 where n = 0 and Ck,l denotes the pixel value of the original image at the point (k, l), and we call this transformation as the transformation of level one. The transformations of level two, three, and so on are obtained by applying the transformation to the low-frequency components one after another as shown in Fig. 7. The HwNet of the Haar wavelet transforms the image stored in the first LM, and stores the transformed image to the second LM. To reduce transforming time, 2-level and 3-level transformation circuits are also implemented.

3.3 Template matching circuit

takes the smallest value, and outputs a value of matching Mmin . As shown in Fig. 8, we designed a circuit using the systolic array architecture to execute high speed matching which is suitable for the HwNet[15]. This circuit group (PE32) consists of 32 PEs (processing element) which calculates the sum of absolute values of the differences between two inputs ain and bin . The HwNet of template matching can be created by arranging some PE32s, and the number of PE32s can be determined under the consideration of the capacity of FPGA. In this paper, we created the template matching circuit with two PE32s which can perform the template matching with the 640×480 input image and the 16×16 template image. As shown in Fig. 9, in the HwNet, the MatchingMain block reads a control commands from LM through the LocalBusController block, and loads an input image and a template image to the BlockRAMs† successively according to its commands. After the execution †

We designed a circuit for template matching as a HwNet. It detects the position where the template image matches the input image, and outputs a position

Small RAM-Blocks in FPGA. Xilinx XC2S200, a FPGA on our HwModule have 14 of 4096bits-ram blocks. They can be read or written 1 to 16bits data on any address by any circuit of the HwNet in 1 clock cycle.

KUDO et al.: HARDWARE OBJECT MODEL

551

CamObject

user application

MatchingObject

user object Sw Object

HwObject Level (On Host) Object Manager

user object Sw Object

user object Hw Object

user object Hw Object

ObjectManager

Camera HwNet HwNet Level (On HwModule)

LM

Block RAM

Local Bus Controller

Matching HwNet

OS

API HwNet Driver

HwModuleDriver OS, device driver HAL

BIOS

PCI-bus, HwModule, etc.

Matching Main

PE32

Fig. 9 The structure of the HwNet for template matching and a schematic diagram for the call of the HwNet for the template matching from the HwObject. Table 2

HwNet HwManager Driver

Control& Output Result Input Image Template Image

Control Output Image

The performance of each HwNet.

HwNets

gates

frequency

processing time

control of camera wavelet trans. template matching

16,341 48,868 244,921

82 MHz 63 MHz 40 MHz

30 frames/s 357 frames/s 26 frames/s

3.4 The performance of each HwNet The gate counts, clock frequencies, and the processing times of each HwNet are shown in Table 2. The processing time is the time to process an 8-bit gray-scale VGA (640×480) image. It is found that the three HwNets have an enough performance to perform the real-time target tracing. 3.5 HwNetDrivers The HwNetDrivers and the HwNets compose the HwNet library. When the application constructs the HwObjects, the HwNetDriver is temporally linked by the HwManager. The HwNetDriver must have essential functions as follows: 1) Pre- and post-process functions which let the HwManager initialize the HwNetDriver at the loading phase of the HwNetDriver, and vice versa. 2) Data access functions which request to access the LM according to the data format specified for the HwNet. 3) Control functions which receive the requests from the HwObject, and, after that, send commands to the HwNet or read the status of the HwNet

hardware device

Fig. 10 The layers of the components for supporting the HwObject operations.

based on the received requests. To control the HwNet with the above functions, the HwObject has four operations as, creation, initialization, execution, and deletion, and they are performed by the HwManager with the HwNetDriver. Since the HwNet has own functions and data format, all the specific informations are included into the individual HwNetDriver. 4.

of template matching by PEs, its result is stored to LM. The MatchingObject is called by the SwObject on host, then puts the parameter to the MatchingHwNet through LM after converting its format to that the HwNet is able to execute.

application

Implementation of HwObject

4.1 Outline of HwObject Since the HwObject has the HwNet loaded on the FPGAs in the HwModule, the interfaces of the HwObjects will be necessarily loaded into the main memory because the application in the maim memory accesses the HwNets via the PCI-bus. As shown in Fig. 10, the application calls the HwObject like the software object (SwObjects). The HwObject requests the ObjectManager to make a task to connect the target HwNet. The HwManager performs the task by using the HwNetDriver, and send a command to the HwNets in the HwModule. The command is put on the HwModuleDriver to send the physical signals to the HwModule and the HwNet via PCI-bus. With the above steps, the software programmers can use the HwNet circuits without the detailed knowledge of the hardware. Application design: Public functions of the ObjectManager, the HwObject, and the SwObject are visible at the user application layer. From the view of the application development, it is a strong point to use conventional development tools for the object oriented applications. All the objects for the application are derived from the base classes of the HwObject and the SwObject, but multiple inheritance from both of them is inhibited of course. Free from platform: The primary aim is to make a free tool from the OSes, the compilers, and the hard-

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.3 MARCH 2004

552

ware platforms. All the components of the ObjectManager, the HwManager, the HwObject, and the SwObject are built in a library form, and are derived from the one base class uniformly, where portability, testability and maintenance are considered to be simple as much as we can. The overhead caused by the object oriented approach will be increased, but we take it negligible from our experience. Two Device Drivers: We take two device drivers to separate the HwNet from the hardware environment. The HwModuleDriver is a device driver for the HwModule, and, when an operating system wakes up, it is attached to the OS. HwModuleDriver is formed to include differences among the various HwModule boards, and provides common access functions such as hardware control operation by the OS, interruption, and data transmission. The HwManager sets up them by reading the HwModuleDriver. The HwNetDriver is attached to the HwManager after the application is executed. Note that the HwNetDriver is mounted temporarily on the HwManager when the HwNet is required, thus the OS does not recognize the HwNet as a hardware. HwNetDriver is formed to include commands, status, and data formats for the HwNet, and provides common access functions to the HwObject and the HwManager. Reusability: The standard device such as the HwModule is attached to the conventional PC, and also, we integrated conventional FPGA devices on the HwModule. Then, we specify the virtual and temporal circuit, i.e., the HwNet and the HwObject. This design style is preferable for providing the HwObject library as same as the software object library. Though the common HwNets may be less efficient than the special circuit, we think that new design method is invented such as dynamical HwNet combination, reconfigurable mechanism, evolutional circuits, and so on. 4.2 Detailed description of the ObjectManager The ObjectManager has the cycle time of the application, the event-queue of the objects, and the management table for the objects. The execution of all the functions can be controlled by both the OS and the ObjectManager in the application. It performs 1) the execution of the requested object; 2) the control of the cycle for synchronization among objects; 3) the construction and the destruction of the objects; 4) the management of the HwManager, and the registration of the HwObjects; 5) the management of the dependencies of the objects. Implementation: We utilized the event-driven and non-preemptive multitasking OS because it is relatively simple, fast, easy to implement, and independent of the platforms. On the other hand, the non-preemptive multitasking OS has a shortage that it cannot force to switch the controls, but we think this shortage would

id lin g

S w OS w b j O e c b t j e c t

re q u e s tin g

V ie w O b je c t

E v H w M o d u le , 6 E v S w O b je c t, 5

O b je c tM a n a g e r H w M o d u le

s ig n a l

H w O b je c t

C a m O b je c t p ro c e s s in g

E v H w O b je c t, 5

E v H w M o d u le , 4

C y c le = 5 H w N e t

H w M a n a g e r

Fig. 11 A schematic diagram of the function of the ObjectManager.

v o id V ie w O b je s w itc h ( n S e q c a se 0 : c a m c a m c a s e 1 : w tO w tO c a se 2 : C o m c a m c a m d e fa u lt: ; } }

c t::O p e ra te ( u in t n ) { O b j-> R e q C a p tu re O b j-> P o s tS ig n a l( b j-> R e q T ra n s fo rm b j-> P o s tS ig n a l( m p le te (); O b j-> R e q C a p tu re O b j-> P o s tS ig n a l(

S e q ){ (); m S ig P 1 ( O p e ra te , 1 ) ); b re a k ; (); S ig P 1 ( O p e ra te , 2 ) ); b re a k ; (); m S ig P 1 ( O p e ra te , 1 ) ); b re a k ;

Fig. 12 The interface command example. The ViewObject controls CamObject(camObj) at nSeq=0 or 2, and WtObject(wtObj) at nSeq=1.

not cause the lowering of the performance because the HwObjects are implemented as circuits and their execution is fast, and the ObjectManager is attached to the preemptive OS directly. Task and event: The HwManager, the HwObjects, and the SwObjects are derived from the common base class, and managed by the ObjectManager. The task is defined as a process of the HwNet. Mainly, there are three kinds of events, namely, the starting of tasks, the asynchronous request of process completion from the HwNet, and the periodic acquisition of the HwNet status. The tasks are started by the events, and the completion event is issued when the task is completed. The cycle is stored in each event as shown in Fig. 11. Note that the cycle also denotes the priority of the task. When there are multiple asynchronous requests from the hardware in one cycle, the request ID is assigned to the event so that the order of the asynchronous requests is preserved. Cycle and priority queue: The ObjectManager implements the priority queue for the events occurred asynchronously, and also the events for the periodic accesses to the HwNets via the HwModule. Both events work independently scheduled on the priority queue as shown in Fig. 11. The ObjectManager and the base class of each object refer to the identical cycle, thus the whole system can work consistently.

KUDO et al.: HARDWARE OBJECT MODEL

553

Signal: The application with multiple objects which work in parallel, requires a method for synchronization. For example, with such a method, this application can activate an object after the task of another object is completed. For that purpose, the ObjectManager provides the event flag to change the status, and the signals which send an event to other objects. With this method, the dependence of objects is realized by the signal, and the cooperation of multiple objects is realized. At the same time, the signals realize the asynchronous access to the HwModule. Synchronization: Figure 12 shows ”Operate”, a member function of the ViewObject(derived from SwObject) which controls the other HwObjects named CamObject(pointed by camObj) and WtObject(wtObj). These objects are described in the next section. The Operate synchronizes the CamObject and the WtObject with the variable nSeq, as described in the following sequence: 1) As shown in Fig.12, the ViewObject calls ”camObj>ReqCapture()” which requires the CamObject to capture the image from the CMOS camera. Then, as shown in Fig.11, the CamObject requests the HwNetDriver to start the HwNet. Then the HwNetDriver issues an event(EvHwModule,4) to let the HwManager access the HwModule asynchronously, and push it to the PriorityQueue of the ObjectManager. Note that the number 4 of this event denotes the cycle when the event is issued, and this number also means the priority of the event. During this process, the state of the ViewObject changes to the requesting state. Then the cycle of the application advances. The processing time of this step, i.e., the overhead was measured as 38 µs. 2) Then the above event is poped and interpreted by the ObjectManager. The HwNet of the CamObject starts and the state of the CamObject changes to the processing state as shown in Fig.11. The overhead of the event execution is 5 µs. 3) As shown in Fig.12, after the call of ”camObj>ReqCapture()”, the ViewObject calls ”camObj>PostSignal( mSigP1(Operate,1) )” which posts a signal object to the CamObject, and it is stored by the CamObject while the state of the CamObject is unchanged. When the HwNet completes the process, an asynchronous request is sent to the host from the HwModule through the HwModuleDriver attached to the OS. Then the HwManager issues an event(EvHwModule,6), and push it to the PriorityQueue of the ObjectManager. When this event is poped and interpreted, the HwManager gets the state of the HwNet by calling the HwNetDriver procedure, and HwNetDriver changes the state of the CamObject to the completion state. Triggered by this change of the state, the signal object stored in the CamObject is interpreted, and it calls ViewObject::Operate with nSeq=1. The overhead of this step is measured as 20 µs.

4) The processing time of the above sequence is measured as 39 ms. Since the HwNet of the CamObject completes the task in 33ms as shown in Fig.15(a), the OS consumes 6 ms to manage the other processes. Therefore the overhead time 38+5+20=63µs is negligible. 4.3 Detailed description of the HwManager The HwManager is an object, as a BIOS for the ObjectManager, which manages the multiple HwModule with the HwModuleDriver, and provides a method for access to the LMs and the HwNets from the host. It performs 1) the management of the hardware resources such as LMs and FPGAs; 2) the load and unload of FPGAs; 3) the supply of the access method to LMs and HwNets to the HwObjects; and 4) the call of the appropriate procedures of HwNetDriver when the asynchronous request from the HwModule is received. It has a management table which connects the HwObjects, the HwNets, and the HwNetDrivers. Moreover, the HwManager coordinates the accesses to the one HwModule from the multiple HwObjects. The management informations are updated periodically by the periodic events from the HwManager, in addition to the ordinary events from the application and the asynchronous request from the HwModule. Similarly to the ObjectManager, the HwManager is generated as an object automatically when the application is started. Creation and Destruction of the HwObject: When the application constructs the HwObjects, the HwManager downloads the HwNets to the specified FPGA by sending the command to the HwModule. The HwManager also destructs of the HwObjects similarly. 5.

Application

Conventional IP’s (Intellectual Property) are intended to work in synchronization with circuits integrated in ASIC chips. On the contrary, the HwNet is enfolded into a capsule of the HwObject, and it works within an application program like SwObject. In this section, we present how an application is implemented with HwObjects and SwObjects, where the original application was programmed with only SwObjects to study the algorithm for the multi-level matching for detecting faces[16]. Then, we explain how the application works, and we list up strong and weak points of the HwObject from the programmer’s view point. Lastly we describe whether the HwObject is better or worse to the hardware designers who provide the HwObject libraries. 5.1 Implementation from software to hardware First, we designed an application using only the SwObjects which searches the target image from the input image using the template-matching algorithm. When

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.3 MARCH 2004

554

R e a d tim e

in Im a g e = c a m O b je c t-> g e tV ie w ( ); w tIm a g e = w tO b je c t-> T ra n s (); W h ile ( v a lu e > th re h o ld ){

V a lu e = fa c e O b j-> m a tc h in g ( le v e l );

6 4 0 x 4 8 0 3 2 0 x 2 4 0 1 6 0 x 1 2 0

5 4 0 m s

w a v e le t

tra n s fo rm

6 4 0 x 4 8 0 3 2 0 x 2 4 0

3 1 0 m s

1 6 0 x 1 2 0 8 0 x 6 0

3 0 m s 1 0 m s

T e m p la te

}

1 1 0 m s 1 0 m s

3 2 0 x 2 4 0

m a tc h in g tim e 6 se c o n d s 1 se c

FaceObject Face Template

Source Image

CamObject

(a ) F a c e O b je c t::m a tc h in g ( le v e l) { /* re c u rs iv e c a ll* / V ie w O b je c t.G e t _ f o c u s _ d o m a in ( le v e l ) ; M a tc h in g _ b o d y ( ) If( re c o n s tru c t_ m a tc h in g _ re g io n ( t1 @ le v e l ) { F o r( ){

Data Flow

ViewObject Control

Virtual Image

Fig. 14 Top level view of the data flow of the application which performs target tracing by the wavelet-template matching.

A v e ra g e fittin g ( ); … a n d s o o n p re p ro c e s s … M in _ m = m in ( M in _ m , B lo c k s e a rc h ( ) );

}

If(M in _ m

< th re s h o ld ) re tu rn tr u e e ls e fa ls e ;

}

)

{ n e x tL e v e l_ V ie w O b je c t.b a c k _ f o c u s ( le v e l - 1 ) ; r e tu r n m a tc h in g ( le v e l - 1 ); }

(b ) {

)

/ / = te m p la te - in I m a g e ( x ,y )

in < m in O ld ) if f ( x ,y ) = n e w D if f ( x ,y ) ; m in O ld = m in ;

n e w D iff (n e w X , n e w Y ) = te m p la te M a tc h in g ( D iff (x , y )

}

MatchingObject

WtObject

9 0 m s

1 6 0 x 1 2 0

D iff(x , y W h ile ( ) if ( m D

Transformed Image(Scale&Wavelet)

if( (n e w X , n e w Y ) = = (X , Y ) ) x = n e w X ; y = n e w Y ; m in = n e w D iff > m in ;

b r e a k ;

(c )

Fig. 13 (a)main flow of the MatchingObject in the application with three SwObjects. The performance of each SwObject is obtained under the condition of 1.2GHz Athlon with 768MB main memory. (b)Pseudo-code of FaceObject::matching( ). The matching is called recursively with respect to the wavelet level. (c) Pseudo-code of the main loop for the template matching. The templateMatching is done at each displacement of x and y step by step.

the algorithm is fixed through the design steps of verification and testing, we examine the bottleneck of the application. Figure 13(a) shows main SwObjects as CamObject, WtObject, and FaceObject of the application with a timing profile with 1.2GHz CPU and 768MB main memory. A pseudo-code of the matching function of FaceObject is shown in Fig. 13(b), and a main loop of the template matching, where the templateMatching calculates the matching difference Diff(x,y), is shown in Fig. 13(c). Each execution time of the SwObjects is measured as a linear function of the the number of input image pixels for image input, a quadratic functions for wavelet transform, and the 3rd order of magnitude function for the template matching. Even in 160×120

pixels case, the matching time as 1 second is not acceptable. The execution time as 6 second is much over the border of our specification of the real time matching system. Then, the application is reformed to acquire the images from CMOS camera which is connected to the GPIF of the HwModule, apply the wavelet transformation to the input image, perform the template matching between the transformed image and the face templates which are prepared on the host computer, and construct a virtual image on the host using the matching result. The application uses HwNets for the image acquisition and the wavelet transformation, and performs them in parallel. Note that the HwNet for template matching is not used in this application, and only the low frequency components of the wavelet transformations, n in Fig. 7 are used. Figure 14 shows the flow namely, Cij of this application. which traces the face target in the source image. It is observed that five objects, namely, CamObject, WtObject, ViewObject, MatchingObject, and FaceObject are defined for this task. 5.2 Cooperations of the objects In the following, the roles of these objects are explained by following the flow of the process shown in Fig 15. The CamObject is a HwObject for the image acquisition from the CMOS camera in 33ms(30 frames/sec), and write the input image on LM1. When the CamObject is constructed, the HwNet for controlling camera would be downloaded to the FPGA1 in HwModule in 50msec. The WtObject is a user class derived from the base class of the HwObject as shown in Fig. 16(a). The WtObject performs the wavelet transformation, and its HwNet is downloaded to FPGA2 by its constructor call as shown in Figs. 15(b) and 16(b). It reads the input image from LM1, and writes the transformed image on LM2, and it takes 18 ms for 3 level transformations

KUDO et al.: HARDWARE OBJECT MODEL

555

H o st P ro c .

L M 1

M u lti-le v e l F a c e D e te c tin g

L M 2

M a tc h in g R e s u lts

T ra n sfo rm e d D a ta

Im a g e D a ta

P C I IF

te m p la te

4 0 m s (2 5 fra m e s /s e c )

P C I-b u s : 2 5 fra m e s /s e c C M O S Im a g e se n so r

L M 3

Im a g e D a ta

F P G Im C a p h w

A 1 a g e tu re N e t

3 3 m s (3 0 fra m e s / se c )

F P G A 2 W a v e le t T ra n sfo rm h w N e t

F P G A 3 T e m p la te M a tc h in g S y s to lic a rra y

1 8 m s (3 le v e l a ll)

(a )

c la s s W tO b je c t : p u b lic H w p u b lic : W tO b je c t( v o id ); v o id T ra n s M o d e ( E T ra n s M v o id In p u t( c o n s t v o id * c p v o id O u tp u t( v o id * p v Im g v o id R e q T ra n s fo rm ( v o id p ro te c te d : v o id In itH w N e tL in k ( R H N v o id C o m p le te d ( v o id ); v o id E n a b le d ( v o id ); } ; ( a )

1 fra m e : 6 4 0 x 4 8 0

V ie w w h w

A p p lic a tio n s ta rts C o n s tru c to r: C a m O b je c t W tO b je c t O th e r F u n c tio n s + O p e ra tio n s

L o a d c a m - , w t- H w N e ts :5 0 m s /e a c h A tta c h : H w N e t d riv e rs

Im a g e c a p tu re : 3 3 m s (3 0 fra m e s / s e c ) W a v e le t tra n s fo rm : 1 8 m s (3 le v e l a ll) P C I-b u s : 2 5 fra m e s /s e c T e m p la te m a tc h in g c a lc u la tio n : 4 0 m s (2 5 fra m e s /s e c ) P C I-b u s : 2 5 fra m e s /s e c U n lo a d :H w N e ts , d e ta c h :H w N e t d riv e rs

D e s tru c to r: H w O b je c ts

(b ) Fig. 15 (a) The data flow and their processing times in the HwModule. (b) Execution profile of the application with the HwObjects.

at once. The LM buses for LM1 and LM2 are independent, thus the load, the transformation, and the store of the image are performed in parallel, namely, the macropipelined operations are realized. After the completion of the wavelet transformation, the event flag is set to the WtObject. The ViewObject calls two constructors of CamObject and WtObject and assigns their locations as shown in Fig. 16(b). The ViewObject reads the transformed image after receiving the signal from the WtObject by the statement ”ViewObject->Operate” as shown Fig. 16(c). The input image is transmitted via PCIbus at the rate of 25 frames/sec as shown in Fig. 15(b). Then the ViewObject puts it on the MatchingObject with the request for the further processing. The FaceObject is a SwObject which has face templates for multi-level matching. The FaceObject receives the position of the template from the MatchingObject, and then it draws the template image to the input image as shown in Fig. 17. The MatchingObject is a SwObject which includes the wavelet-transformed FaceObjects. It performs the multi-level template matching after fitting their brightness to that of the matching area in the input images.

}

O tO n v tO

b je c b j = .h w M b j->

O b je c t { ~ W o d v Im , E O );

tO b je c t(); e e tm ); g ); u tM o d e e o m

);

L rh n l );

t::V ie w O b je c t( v o id ) { n e w W tO b je c t; o d u le = 1 ; h n v .p r o g D e v = 2 ; R e f H w N e tL in k ( ) .d V e c to r = h n v ;

c a m O b j = n e w C a m O b je c t( w tO b j ); h n v .h w M o d u le = 1 ; h n v .p r o g D e v = 1 ; c a m O b j- > R e f H w N e tL in k ( ) .d V e c to r = h n v ;

if( !v ie w O v ie w v ie w } e ls e { v ie w O v ie w O } in B itm a p

( b ) b j ) { O b j = n e w V ie w O b je c t( ); O b j-> s e tIm a g e (c u rS h e e t-> im a g e ); b j-> w tO b j-> T ra n s M o d e (W tO b je c t::e tm L 3 L L L L ); b j-> O p e ra te ( ); = v ie w O b j-> g e tIm a g e ( )-> P ic tu re -> B itm a p ; ( c )

Fig. 16 Source code examples. (a)Header file of the WtObject. (b)Constructor of the ViewObject which holds input images from the CamObject. (c)A part of the member function of the MatchingObject.

This process consumes much time over 6 seconds in the case of 320×240 image pixels. If the FaceObject is replaced with the HwObject with the template matching circuit presented in Sec. 3, the execution time will be reduced to 40ms. As the results of the matching, it outputs the best position of the FaceObject where the matching measure between the face template and the input image is minimized. The HwManager(with HwNetDriver) registers the asynchronous request from the HwModule as an event with the HwNet-ID. When the MatchingObject sends an request event for the image acquisition to the ViewObject and the CamObject, The HwNetDriver sets the event flag to command the CamObject to start the data processing. The CamObject stores one frame of 640×480 gray-scale image to LM1, and it sets the completion status flag which is watched by the MPU on the HwModule. At the same time, the ViewObject also sets the signal which requests the WtObject to start the transform processing when receiving the exe-

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.3 MARCH 2004

556

Fig. 17

The result of matching.

cute status flag generated from the completion status flag from the CamObject by the MPU. ObjectManager schedules the completion event of the MatchingObject directed to the FaceObject. It is issued by the HwManager when the status of the HwNet for wavelet transform is received. At the scheduled time, the FaceObject receives the completion event from the MatchingObject. The MatchingObject sends an event for the image acquisition to the ViewObject again, and the above processes are repeated. As a whole, two HwObjects, namely, the CamObject and the WtObject cooperate, and perform the image acquisition, the wavelet transformation, and the image output like the usual stream processing in the digital signal processing. On the other hand, on the host, the objects process their tasks in a hand-shake manner, and the application continues to generate the input image. 5.3 Properties from the software designer’s view It is preferable to separate a application development into design stages. The first design stage, a software designer focuses on the algorithm, data structures, exceptions, debugging errors, extendibility/flexibility of the application, and so on. Small but various test samples are better for these design steps. After the algorithm and data structure fixed, the application is shaped up and tested under real but huge data problem. It is much desirable that no part of the application is changed for getting superior performance. The replacement of the SwObject with the HwObject, gives a final design easily. Performance: The HwObjects throw out the bottleneck SwObjects. The HwNets run in parallel each other. The separated local buses enables macropipeline operations and also parallel operations. Since the HwNets are not connected to the other ones directly, there are some overhead caused from the PCIbus data transfer and the control operations of ObjectManager and the HwManager. Therefore we must take

out the HwNet from the application in consideration of data transfer and of control sequence. Rapid prototyping and seamlessness: The commercial compiler C++Builder with the HwObject library is used to program the application with its integrated design environment(IDE). We can easily make our HwObjects derived from the HwObject base class, step by step and seamlessly, with adding new member functions. The software designer is free from the traditional design works related to the hardware designs such as core circuits, peripheral circuits and drivers. He/She can trace the whole data flows and controls seamlessly without considering hardware operations in detail. Robustness: Since the ObjectManager and the HwManager run as a part of the application with the HwModuleDriver and the HwNetDriver, the whole PC system does not freeze but only the application crashes when the HwNet goes wrong. Therefore the turn around time of the debugging becomes short like the debugging of the source code of softwares. Object: The HwObject supplied by the hardware designers is as entirely same as the SwObject which includes a property-list, a method-list, and an event-list. Since the preliminary member functions which control the HwNets and their protocols are supplied as protected functions, the software designer can concentrate his attention to the public functions, for example, simple local memory access functions: transMode(), input(), output(), reqTrans() of the WtObject class as shown in Fig. 16(a). Weak point: The software designer has to be somewhat familiar with the hardware operation for the purpose of getting better efficiency. For example, he/she has to know the data transfer rate of the PCI-bus, data locations, SIMD or MIMD operations, and data-flows among the host-processor and the HwNets. We have not implemented a direct access method to the HwNet yet. 5.4 Properties from the hardware designer’s view The HwObject is not a conventional IP, but a virtual circuit effectively working in the program. The HwObject is temporal, reconfigurable, and evolutional though we do not represent these features explicitly. It is yet open problem to the hardware designers what kind of circuits have to be prepared. IP provider: The HwObject model presents one of the standard libraries which is independent from the OS, FPGA, PCI-Bus protocol, and the board structure, because we separate the HwNet from the HwModule which has all the Hardware information in the HwModuleDriver. This separation makes the HwNet be a virtual circuit with the HwNetDriver. Therefore IP providers will circulate resynthesizable HDL designs as the HwObject library, and the system designers use

KUDO et al.: HARDWARE OBJECT MODEL

557

them such as the object component library attached to the compiler. Rapid Design: We can easily make an application of the test bed for verifying the functions of a target design by making the HwObject. If the HwNet of the test bed is loaded to provide the signals to the target HwNet, a timing design step will proceed seamlessly. HwNet debug: The micro-processor is used for watching the HwNet operation, and the hardware designer can utilize it for debugging the HwNets. We use it to watch the HwNets during our HwNet designs. Top down design: C++ specification designs are translated to the RTL designs by the high-level synthesis, and then a logic synthesis program generate loadable bit-stream-images from them. HwObject cooperation: Since multiple HwObjects are virtually connected via memory accessing, the various connections such as serial, parallel, pipeline are realized by describing allocation statements in the constructor of the HwObject. To reduce the overhead of memory access by the HwNet, the micro-processor on the HwModule controls the HwNets directly via the control bus without interfering the host processor. But we need much experience to improve the overhead of the area and timing for the HwNet cooperation. HwNet access: Another overhead exists in the access from the member functions to the HwNets via the PCIbus and the LM’s. Current PCI-bus is too slow and narrow for the HwObject, but it will be improved if the HwObjects are used widely. Thus it is not a technical but a commercial problem. Currently it is better to avoid frequent and small data transfers. ObjectManager: From our experience, the grain sizes of the circuits used in the program are not small, and the operations and the data accesses are occurred locally on the HwModule. Also, the event handling by the ObjectManager is not the peculiar overhead caused by the HwObjects, because the event, i.e., the message, is commonly used in the object-oriented programming. Constraint: The two data ports and the two address ports, and the several control ports are necessary. This constraint reduces the HwNet flexibility. However, we can design various HwNets because the other constraints are negligible. Standardization: If the HwModule and the interface protocol is standardized, the HwNet will be free from FPGA structure, OS, compiler. Also the HwNetDriver is designed easily. 6.

Conclusions and discussions

We implement a platform named as a hardware module (HwModule) to execute the physical circuit in a software as a hardware object (HwObject). The HwModule is attached to the PCI bus of the standard Windows PC, and has three FPGAs for the applications. Moreover, the object manager (ObjectManager) to access

the HwModule from the host computer is also implemented. With the ObjectManager, the circuit (HwNet) in the HwModule is easily accessed from the standard C++ compiler and can be used as an object in the application. As examples of the HwNets, we designed three circuits, namely, the control circuit for camera, wavelet transformation circuit, and template matching circuit. To show the efficiency of our HwModule, an application for the multi-level image matching which uses HwObjects and SwObjects is introduced. This application acquires images from CMOS camera which is connected to the GPIF of the HwModule, applies the wavelet transformation to the input image, performs the template matching between the transformed image and the face templates which are prepared on the host computer, and constructs a virtual image on the host using the matching result. In the present paper, we did not use the template matching circuit, and used a software object for the matching. If the template matching circuit is used, the real-time tracing of the target would be realized by the repetitive template matching. We found out that the separation into the HwModule device driver and the HwNet device driver is essential because the HwNet device driver must be attached dynamically when the specific HwNet is required. Moreover, the ObjectManager is built based on the many methodologies/knowledges/techniques related to the OS, the compiler, and the hardware. If they are built up as the infrastructure, then the HwObjects, i.e., the HwNet libraries and also IPs become easy to use without complicated and tedious skills. On the other hand, multi-lateral accesses or controls among SwObjects and HwObjects are also easily realized as shown in the application example. However, designing control sequence of the application is relatively difficult because the sequential controls of the software and the parallel executions of the hardware are contradictory things each other. Much designs on the hardware and software heterogeneous system will be necessary to find out the feasibility of the HwObject. We believe that the HwObjects are still effective even in the field of the complex and huge amount of data manipulation, or of the digital signal processing. For more complex examples, the database of hierarchically characterized templates could be generated by the self-organizing map (SOM)[17], which divides the input picture space into subsets, and divides this subset into the further subsets. The SOM learns the distance of the templates and forms the topological map of templates. Then, this map performs clustering. Now we are examining the design for a HwNet of the 2-dimensional array structure of the SOM. With this learned database, the preceding application, which searches the target in the database, performs the pattern recognition of the target. Then, the virtual image of the application could be used as a space for the image

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.3 MARCH 2004

558

recognition. Currently, the load on the microprocessor goes on more heavily day by day. Another way, ASIC chips become successively huge, and it looks like inevitable. But the HwObject may be thought to provide the 3rd way to compensate their shortages such as the power consumption, testability, or the design and fabrication costs. From the view of the application developer, the HwObject supplies us with as same methodology as the object-oriented programming, but the circuits (HwNets) on the HwModule must be written with the standard hardware-description languages. Recently, the description of circuits with the C language from the behavior level is attracting considerable attentions[18], [19]. When such methods become mature, HwNets could be described with the C language, and the development of the applications for the HwModule might become much easier for software programmers. Acknowledgement This work is supported by VLSI Design and Education Center(VDEC), the University of Tokyo with the collaboration with SYNOPSYS Corporation. References [1] R. Tessier and W. Burleson, “Reconfigurable computing for digital signal processing: a survey,” Journal of VLSI Signal Processing, Vol.28, pp.7–27, 2001. [2] Y. Sueyoshi and M. Iida, “Configurable and reconfigurable computing for digital signal processing,” IEICE Trans. Fundamentals, Vol.E85-A, no.3, pp.591–599, 2002. [3] H. Amano, Y. Shibata, and M. Uno, “Reconfigurable systems: new activities in Asia,” Proc. FPL2000, pp.585–594, 2000. [4] E. Caspi, M.Chu, R. Huang, J. Yeh, Y. Markovskiy, A. DeHon, and J. Wawrzynek, “Stream computations organized for reconfigurable execution (SCORE): introduction and tutorial,” http://www.cs.berkeley.edu/projects/brass/ documents/score tutorial.html. [5] J.R. Hauser and J. Wawrzynek, “Garp: a MIPS processor with a reconfigurable coprocessor,” in Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pp. 24–33, 1997. [6] D.A. Buell, J.M. Arnold, and W.J. Kleinfelder, “Splash 2: FPGAs in a Custom Computing Machine,” Wiley-IEEE Press, 1996. [7] J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H.H. Touati, and P. Boucard, “Programming active memories: reconfigurable systems come of age,” IEEE Trans. on VLSI, pp.56–69, 1996. [8] P. Bellows and B. Hutchings, “JHDL - an HDL for reconfigurable systems,” in Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pp.175–184, 1998. [9] M. Chu, N. Weaver, K. Sulimma, A. DeHon, J. Wawrzynek, “Object oriented circuit-generators in Java,” in Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pp.158–166, 1998. [10] D. Davis, M. Barr, T. Bennett, S. Edwards, J. Harris, I. Miller, and C. Schanck, “A Java development and runtime

[11]

[12]

[13] [14] [15] [16]

[17] [18] [19]

environment for reconfigurable computing,” in Proceedings of 5th Reconfigurable Architectures Workshop, pp.43–48, 1998. M. Sekine, T. Kanamaru, K. Kudo, H. Imanaka, Y. Shiga, H. Ito, and Y. Myokan, “Hardware objects of the circuits for robotics,” in Proceedings of 2003 IEEE International Symposium on Computational Intelligence in Robotics and Automation, pp.1421–1426, 2003. JTAG is an abbreviation for joint test action group. The standard test access port and boundary scan architecture was standardized as IEEE 1149.1 in 1990. Opencores web site, http://www.opencores.org/. S. Mallat, “A wavelet tour of signal processing,” Academic Press, 1999. T.Yahagi et al., “VLSI and Signal Processing,” Corona Pub., CO.LTD., pp..202-205, 1997 (in Japanese). M. Sekine, T. Kanamaru, and H. Ito, “Multi-level matching for detecting faces,” IEICE Trans. on Fundamentals (Japanese edition), vol.J86-A, no.9, pp.969-973, 2003. T. Kohonen, “Self-organizing maps,” Springer-Verlag, 2000. SpecC web site, http://www.specc.org/. SystemC web site, http://www.systemc.org/.

Suggest Documents