FPGA-based Real-time Optical Flow Algorithm ... - Semantic Scholar

13 downloads 106 Views 871KB Size Report
proposed by Horn and Schunck [7] was implemented in. [2] and [3]. It is an iterative algorithm where the accuracy depends largely on the number of iterations.
38

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 5, SEPTEMBER 2007

FPGA-based Real-time Optical Flow Algorithm Design and Implementation Zhaoyi Wei, Dah-Jye Lee and Brent E. Nelson Department of Electrical and Computer Engineering, Brigham Young University, Provo, Utah, 84602 USA Emails: [email protected], [email protected], [email protected]

Abstract—Optical flow algorithms are difficult to apply to robotic vision applications in practice because of their extremely high computational and frame rate requirements. In most cases, traditional general purpose processors and sequentially executed software cannot compute optical flow in real time. In this paper, a tensor-based optical flow algorithm is developed and implemented using field programmable gate array (FPGA) technology. The resulting algorithm is significantly more accurate than previously published FPGA results and was specifically developed to be implemented using a pipelined hardware structure. The design can process 640 × 480 images at 64 fps, which is fast enough for most real-time robot navigation applications. This design has low resource requirements, making it easier to fit into small embedded systems. Error analysis on a synthetic image sequence is given to show its effectiveness. The algorithm is also tested on a real image sequence to show its robustness and limitations. The resulting limitations are analyzed and an improved scheme is then proposed. It is then shown that the performance of the design could be substantially improved with sufficient hardware resources. Index Terms—Optical flow, FPGA, Motion estimation

I. INTRODUCTION In this paper we propose a modified optical flow algorithm that is suitable for FPGA implementation, and which can be used in real-time autonomous navigation, moving object detection and other applications which require real-time, accurate motion field estimation. Optical flow aims to measure motion field from the apparent motion of the brightness pattern in an image sequence. Optical flow is one of the most important descriptions for an image sequence and is widely used in 3D vision tasks such as motion estimation, structure from motion (SfM), and so on. The basic assumption of optical flow algorithms is the brightness constancy constraint, which assumes that image brightness changes between frames are due only to camera or object motion. In other words, if the interval between frames is small, other effects causing brightness changes (such as changes in lighting conditions) can be neglected. However, the processing time of existing optical flow algorithms is usually on the order of seconds or tens of seconds per frame using general purpose processors. This long processing time thus prevents optical flow algorithms © 2007 ACADEMY PUBLISHER

from being used for most real-time applications such as autonomous navigation for unmanned vehicles. In recent years, a number of different schemes have been proposed to implement optical flow algorithms in real-time. The basic idea behind them is to use pipelining and/or parallel processing system architectures to do the computation. Using a pipeline image processor, Correia and Campilho [1] proposed a design which can process the Yosemite sequence of 252×316 images in 47.8ms. FPGAs have also been used to process larger images at faster speeds [2]–[6] because of their configuration flexibility and high data processing speed. An algorithm proposed by Horn and Schunck [7] was implemented in [2] and [3]. It is an iterative algorithm where the accuracy depends largely on the number of iterations. The classical Lucas and Kanade approach [8] was also implemented [4] for its good tradeoff between accuracy and processing efficiency. Many optical flow algorithms have been developed in the last two decades. During this time, 3D tensor techniques have shown their superiority in producing dense and accurate optical flow fields [9] – [14]. The 3D tensor provides a powerful closed representation of the local brightness structure. For example, in [10] an accurate and fast tensor-based optical flow algorithm was modified for FPGA hardware implementation. The hardware architecture of this implementation was adjusted to reach a desired tradeoff of accuracy for resource utilization. In the work described in [21], a tensor-based optical flow algorithm was proposed and implemented in an FPGA. The resulting design was accurate and well-suited for pipeline hardware implementation. It is able to process 640 × 480 images at 64 frames per second or faster, which is adequate for most real-time navigation applications. To the best of our knowledge, the design described in [21] achieves a 30% accuracy improvement on the Yosemite image sequence and executes faster (19661 K pixels per second) than any previous work. It is also an efficient design that uses fewer FPGA resources than previously reported work. Potential applications of this design include real-time navigation and obstacle avoidance of Unmanned Autonomous Vehicles (UAV) or other applications which require real-time optical flow computations. In this paper, the hardware structure of that design is described in more detail, additional analysis of

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 5, SEPTEMBER 2007

its performance is provided for real image sequences, and the algorithm is extended to increase its robustness for use in the presence of significant platform jitter and image noise.. This paper is organized as follows. In section 2, the algorithm is formulated and our modifications are introduced. In section 3, the hardware implementation design and tradeoffs made in the design are discussed. In section 4, the current hardware and a new hardware platform are introduced. Performance analysis of the proposed design is shown in Section 5. Conclusions and future work are discussed in section 6. II. ALGORITHM

g (x) can be treated as volume T data where x = ( x, y, t ) is the 3D coordinate, x and y An image sequence

are the spatial components, and t is the temporal component. According to the brightness constancy constraint, object movement in spatial-temporal domain will generate brightness patterns with certain orientations. The 3D tensor is a compact representation of local orientation. There are different types of 3D tensors based on different formulations [15], [16]. Different from the polynomial tensor [12] used in [10]; the gradient tensor is used instead in this design because it is easier to implement with pipelined hardware. For simple signals, the gradient tensor and polynomial tensor give similar estimates of the local orientation [15]. The outer product O of the averaged gradient ∇g (x) is defined as

§ o1 ¨ O = ∇ g ( x )∇ g ( x ) = ¨ o 4 ¨o © 5

o4

T

where

o2 o6

o5 · ¸ o6 ¸ o 3 ¸¹

(1)

∇g (x) = ¦ wi ∇g (x i )

(2)

i

= (g x (x) g y (x) g z (x) )

T

and wi are weights for averaging the gradient. The gradient tensor T can be constructed by weighting O in a neighborhood as

§ t1 ¨ T = ¦ ci O i = ¨ t 4 i ¨t © 5

t4 t2 t6

t5 · ¸ t6 ¸ . t 3 ¸¹

(3)

O is then smoothed to reduce the effect of noise and decrease the singularity of the tensor matrix. The gradient tensor T is a 3×3 positive semi-definite matrix. Optical flow (v x

v y ) T is measured in pixels per

frame and can be extended to a 3D spatio-temporal vector Y = (v x translation

v y 1) T . For an object with only

movement

and

without

noise

in

the

neighborhood, v Tv = 0 . In the presence of noise and T

rotation,

v T Tv will not be zero. Instead, v can be

© 2007 ACADEMY PUBLISHER

39

determined by minimizing v be defined for this purpose as

T

Tv . A cost function can

e( v ) = v T Tv . (4) The velocity vector v is the 3D spatio-temporal vector which minimizes the cost function at each pixel. The optical flow can be solved as

vx = vy =

(t 6 t 4 − t 5 t 2 ) (t1t 2 − t 4 ) (t 5 t 4 − t 6 t1 )

(5)

2

(t1t 2 − t 4 ) 2

.

(6)

The algorithm proposed in this paper is similar to the one in [10] which assumes constant motion. Further details can be found [10]. The affine motion model is used to incorporate the tensors in a neighborhood in [10]. Pixels in a neighborhood are assumed to belong to the same motion model. To reduce hardware resources, the constant model is used in this design, which assumes the constancy of velocity in a local neighborhood. The constant model performs almost as well as the affine motion model when operating in a small neighborhood. III. DESIGN A. Data Flow The data flow of the proposed algorithm is divided into five modules as shown in Figure 1. In the Gradient Calculation module, images are read from memory into the FPGA and aligned to calculate the gradient. The gradient components are then averaged in the Gradient Weighting module (Equation (2)). From the outer product of the averaged gradient components (Equation (1)), five outer product components can be obtained. These components are then further averaged to compute the gradient tensor (Equation (3)). Finally, these tensor components are fed into the Optical Flow Calculation module to compute the final result according to Equations 5 and 6 and then written back to memory. Of importance, there is no iterative processing in this design. Therefore, the computation process can be fully pipelined in hardware to improve its processing throughput and thereby enable real-time use. In the block diagram of Figure 1, the connections between modules consist of only unidirectional data and a corresponding data valid signal. Once a set of data is generated in a module, it is registered into the downstream module for processing. At the end of the pipeline, the results are written back to memory for further processing, display, or analysis. B. Tradeoff between accuracy and efficiency There are three modules in the computation which are all convolution-like operations: Gradient Calculation; Gradient Weighting; and Tensor Calculation. These three modules are critical to the algorithm performance and use the majority of the hardware resources in the design. Therefore, different configurations of these modules were evaluated and compared to reach an optimal solution.

40

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 5, SEPTEMBER 2007

Figure 1 Data flow of the design (Signals with underscore are the weighted signals of those without underscore)

The Gradient Calculation module is a 1D convolution process while the Gradient Weighting and Tensor Calculation modules are both 2D convolution processes. Generally, a larger convolution kernel size will provide better accuracy but will require more hardware resources. Finding the optimal tradeoff between accuracy and efficiency requires finding the optimal convolution kernel sizes so that the design meets its performance requirements but with minimum hardware usage. Once the kernel sizes are decided, the corresponding optimal weights can be specified. The design was simulated in Matlab. These Matlab simulations were designed to match the hardware down to the bit level. There are two purposes for this. First, the performance of the design could be evaluated precisely in order to get a satisfactory tradeoff between accuracy and efficiency. Second, the intermediate variables of the simulation could be used to verify the hardware during debugging. The Yosemite sequence with ground truth was used to evaluate the system performance. ¾ Gradient calculation The gradient calculation is the first step in the computation and errors in it will propagate to subsequent modules. Therefore, a derivative operator with large radius is preferred. If we denote the radius of the derivative operator as r, then 2r+1 frames of data must be read from off-chip memory into the FPGA to calculate the temporal derivative for each frame in hardware because there is not enough on-chip memory in the FPGA to store these frames. If a larger derivative operator is used, the memory bandwidth required to retrieve the needed 2r+1 frames goes up as well. After comparison, a series-designed first-order derivative operator of radius 2 is chosen [17].

D=

1 (1 -8 0 8 -1) 12

(7)

Therefore, to calculate the derivative for one frame, five frames must be read from the memory. ¾ Gradient weighting and tensor weighting As shown in Figure 1, there are two weighting processes in the computation. Both weights wi and ci and are given by the Gaussian function. The kernel is n1×n1 for gradient weighting and n2×n2 for tensor calculation. These weighting processes are essential to suppress noise

© 2007 ACADEMY PUBLISHER

in the image. These two processes are correlated, and so they were analyzed together. Table 1 shows different possible combinations of the kernel size and the best accuracy provided by each combination. To obtain the results shown in the table, the algorithm was tested on the Yosemite sequence and the resulting accuracy (as measured in angular error) compared against ground truth. TABLE I. CONFIGURATIONS OF WEIGHTING PARAMETERS AND ACCURACY

n1

n2

σ1

σ2

3 5 7 9

5 3 3 3

3.0 1.9 2.1 2.4

3.0 2.1 2.5 2.9

Best accuracy 14.8e 15.8e 12.7e 11.5e

Because the weights are given by the Gaussian function, in order to save hardware resources, a 2D weighting process can be efficiently decomposed into two cascaded 1D convolutions. Once the kernel sizes were fixed, different sets of standard deviation ı1 and ı2 were tried to find the best accuracy. ı1 and ı2 were both increased from 0.3 to 3 in increments of 0.1. Ultimately, n1=7 and n2=3 were chosen as optimal. ¾ Hardware optimization For maximum hardware performance, several hardware optimizations were made: a) Pipeline structure. A heavily pipelined hardware structure was used to maximize throughput. Once the pipeline is full, the hardware can produce a result on every clock cycle. At a 100MHz clock rate, the pipeline would be able to compute around 100 million pixels per second (325.5 frames per second at 640×480 resolution). b) Fast memory access. Due to the pipelined hardware architecture used, the major system bottleneck turned out to be memory access. To overcome this, a specialized memory interface was employed. This will be discussed in more detail in the next section. c) Bit width trimming. Fixed-point numbers were used in the design. To maintain accuracy yet save

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 5, SEPTEMBER 2007

41

Figure 2 Bit width propagation

hardware resources, the bit widths used were customselected at each stage of the processing pipeline. Figure 2 is an expanded version of Figure 1 and shows the bit widths used. Data are trimmed twice to save hardware resources, once after the Gradient Weighting module and once after the Tensor Calculation module. It can be observed from equations (5) and (6) that a scaling factor applied to variables t1 ~ t6 will not change the final results. Bit width trimming will not affect the relative ratio of these two equations. d) LUT-based divider. A normal pipelined 32 x 32bits hardware divider would consume too many hardware resources. Therefore, two lookup table-based dividers were used in the Optical Flow Calculation module to decrease latency and save hardware resources. These were used without any adverse impact on result accuracy. IV. HARDWARE PLATFORM This design was implemented on the Xilinx XUP V2P board [18]. As shown in Figure 3, the on-board FPGA is a Virtex-II Pro XC2VP30 which has 13,969 slices. The board also contains 256 MB of off-chip DDR memory.

In the system diagram shown in Figure 3, the FPGA communicates with a host PC through both Ethernet and serial ports. Ethernet is used for high-speed data transfers, specifically the uploading of image sequences to hardware and the downloading of results. The UART is used for control and for outputting debugging information through the serial port. The design was implemented using the Xilinx EDK tools. The most common structure of an EDK-based design is for the DDR memory controller to connect to the PLB bus like any other peripheral. An initial analysis of the memory bandwidth required for the computation (multiple frames must be fetched from memory for each computed result frame) showed that the PLB bus would likely become the performance-limiting component in the design. As a result, a multi-port memory controller (MPMC) architecture was chosen instead. The MPMC used was provided by Xilinx and is a dedicated DDR memory controller that provides four separate memory ports to the rest of the design. As can be seen in Figure 3, one of those ports is connected to the PLB bus to provide PLB-based memory access. However, the optical flow core hardware (the pipeline of Figure 2) is directly connected to two of the

Figure 3 System diagram © 2007 ACADEMY PUBLISHER

42

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 5, SEPTEMBER 2007

Figure 4 BYU’s Helios board

remaining memory ports. This provides at least two benefits. First, memory accesses bypass the PLB bus, leaving it free for other communications within the system. Second, interfacing with the MPMC is much simpler than interfacing with the PLB bus, resulting in a smaller and higher-performance optical flow core implementation. A final advantage of this architecture is that less buffering to communicate with memory is required in the optical flow core. This design accepts 640×480 16-bit YUV images, which a read into the memory through an Ethernet interface. They are then read from the memory into the FPGA through the multi-port memory controller for processing. Prior to processing, the U and V components are removed and only the Y component is processed. The XUP V2P board is approximately 50 square inches in size and thus is unsuitable for use on small ground or air vehicles. The Helios Robotic Vision platform [19] was developed at Brigham Young University to accommodate real-time computing on small autonomous vehicles. The Helios board, shown in Figure 4, is compatible with Xilinx Virtex-4 FX series FPGAs, up to the FX60. The Virtex-4 FX60 has 25,280 slices which is about twice as many as the Virtex-II Pro XC2VP30 in the XUP V2P board. It has almost all of the XUP board’s features but is only the size of a business card and consumes only 2 watts of power. The next goal of our research is to transfer our design to the Helios board for small unmanned ground and air vehicles. V. EXPERIMENTS The system clock rate for our XUP board implementation is 100 MHz. It can process 640x480 frames at a rate of 64 frames per second. This processing speed is equivalent to about 258 fps of image size 320×240. As mentioned above, all three YUV components are read into the memory but only Y is processed. If only the Y component was read into the FPGA, the system speed could be further increased (nearly doubled) by reducing the memory access bottleneck. Our optical flow algorithm uses 10,288 slices (75% of the total slices available).

© 2007 ACADEMY PUBLISHER

A. Synthetic Sequence This design was tested on the Yosemite sequence and the Flower Garden sequence. The hardware for these two experiments is the same. The image sizes of these two sequences are smaller than 640×480, and so were padded with zeros before processing. However, the results are shown in the figures below at their original sizes. Figure 5(a) shows one frame of the sequence and Figure 5(b) is the result from the hardware. We can see that the optical flow field in the sky region is noisy due to the brightness change across frames. The optical flow in other regions is less noisy except the lower left part because of the absence of texture. The average angular error is 12.9° and the standard deviation is 17.6°. The sky region was excluded from the error analysis, similar to most other work found in the literature. In simulation, the division operations in the final step (Equations 5 & 6) are in double precision. In the hardware implementation, we use a 17-bit lookup table to implement the division operations. This accounts for the minor differences between the simulation and hardware results. To the best of our knowledge, the only error analysis on optical flow implemented in an FPGA is given in [4], which also tested their design on the Yosemite sequence. Their average angular error was 18.3° and their standard deviation was 15.8°. Their design processed 320×240 images at 30 fps and used approximately 19,000 slices, almost twice as many as our design. Figure 6 shows our results for the Flower Garden sequence. Optical flow values near the trunk boundary are quite noisy because the velocity within the neighborhood is not constant which does not meet the assumption shown in Equation 3. The optical flow inside a uniform motion region is more accurate than that along the motion boundary. B. Real Sequence Figure 7 shows our system’s results on a real image sequence with the motion field plotted on top of the original image. The image sequence was taken using a camera placed on top of a small toy truck. Between images, the truck was manually moved by a small distance to mimic real truck motion. The truck motion was toward the jig on the ground. The optical field for this sequence is much noisier than the synthetic sequence,

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 5, SEPTEMBER 2007

43

(a)

(b)

Figure 5 The Yosemite sequence and the measured optical flow field (a) 8th frame of Yosemite sequence (b) optical flow field

(a)

(b)

Figure 6 The Flower Garden sequence and the measured optical flow field (a) 10th frame of Flower Garden sequence (b) optical flow field

especially in the far end background. The main reasons for this are as follows: a) The distances moved between frames were controlled by hand. As a result, the movement constancy between frames can not be guaranteed. b) The time interval between frames was slightly too long and so lighting changes between frames were more than what the algorithm can tolerate. c) The camera used is a CMOS line exposure imager which introduces extra noise and artifacts into the images. d) There is little texture in the background regions. We believe this image sequence represents an extreme example of the types of sequences which may be encountered in a real-time machine vision environment. Thus, it has value in demonstrating the shortcomings of optical flow algorithms. But we want to argue that this is mainly restricted by the limited hardware resources. To overcome this, stronger regularization techniques should be used to suppress the noise and smooth the motion field. The following improvements are thus proposed: a) Increase the weighting mask size. From Table 1, it can be seen that larger weighting window sizes produce better accuracy. b) Incorporate temporal smoothing. Temporal smoothing will substantially suppress the noise in the optical flow vectors. However, temporal smoothing can be costly to implement in hardware.

c) Append more weighting processes. One weighting process can be added after module 5 as shown in Figure 1. d) Apply the biased least squares techniques. When the tensor matrix is ill-conditioned, the produced optical flow can be severely distorted. If a weighting scheme is applied afterwards, this result will “pollute” the neighboring vectors. Therefore, biased least squares can be applied to solve this at the cost of a small bias in the result. Details on this approach can be found in [20]. These improvements were incorporated into our Matlab model and simulated. Figure 8 shows the improved results. It can be observed that the generated

Figure 7 Initial result on real image sequence © 2007 ACADEMY PUBLISHER

44

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 5, SEPTEMBER 2007

optical field is much smoother and more accurate. The key point here is not to show the effectiveness of the new algorithm but to illustrate that with more hardware resources, the proposed design can be improved significantly.

resources. With our new Helios platform, it is our next research goal to implement the improved algorithm in a larger FPGA. ACKNOWLEDGMENT This work was supported in part by David and Deborah Huber. REFERENCES

Figure 8 Improved result

The detail of this revision is not given here, but with the abundant hardware resources of the Helios platform, implementation of the improved algorithm is possible. It will require more work to analyze these revisions and find a tradeoff between performance and hardware implementation complexity and achieve a feasible realtime result. This is the next goal of our research.

VI. CONCLUSIONS High computational demand makes it difficult to use optical flow algorithms for real-time applications using general purpose processors. In this paper, a hardware design for a tensor-based optical flow algorithm has been developed and implemented in an FPGA. The new design can calculate optical flow at a speed of 19661 Kpps which is equivalent to 64 640×480 frames per second. The accuracy on the Yosemite sequence is shown to be remarkably better than the previous work found in the literature [4], while using approximately half the hardware resources. Another advantage of a high speed optical flow algorithm is that through higher frame sample rates, the performance can be improved accordingly by better meeting the brightness constancy assumption. Before hardware implementation, the expected design performance in terms of accuracy and resource utilization was carefully evaluated in software. The tradeoff study was conducted through tuning several important parameters such as the kernel sizes of the weighting processes and the derivative operator. Software intermediate results were also used for hardware verification. The algorithm was tested on both synthetic and real image sequences. We see that the proposed design works well on synthetic sequences but is not satisfactory on the real sequence. Some revisions are proposed to improve the system design which requires more hardware © 2007 ACADEMY PUBLISHER

[1] M. Correia, A. Campilho, “Real-time implementation of an optical flow algorithm”, Proc. ICIP, Vol. 4, pp. 247-250, 2002. [2] A. Zuloaga, J. L. Martín, J. Ezquerra, “Hardware architecture for optical flow estimation in real time”, Proc. ICIP, Vol. 3, pp. 972-976, 1998. [3] J. L. Martín, A. Zuloaga, C. Cuadrado, J. Lázaro, U. Bidarte, “Hardware implementation of optical flow constraint equation using FPGAs”, Computer Vision and Image Understanding, Vol. 98, pp. 462-490, 2005. [4] J. Díaz, E. Ros, F. Pelayo, E. M. Ortigosa, S. Mota, “FPGA-based real-time optical-flow system”, IEEE Trans. Circuits and Systems for Video Technology, Vol. 16, no. 2, pp. 274-279, Feb 2006. [5] P. C. Arribas, F. M. H. Maciá. “FPGA implementation of camus correlation optical flow algorithm for Real Time Images”, 14th Int. Conf. Vision Interface, pp. 32-38, 2001. [6] H. Niitsuma, T. Maruyama, “High speed computation of the optical flow”, Lecture notes in Computer Science, Vol. 3617, pp. 287-295, 2005. [7] B. Horn, B. Schunck, “Determining optical flow”, Artificial Intelligence, vol. 17, pp. 185-203, 1981. [8] B. D. Lucas, T. Kanade, “An iterative image registration technique with an application to stereo vision”, Proc. DARPA Image understanding Workshop, pp. 121-130, 1984. [9] G. Farnebäck, “Very high accuracy velocity estimation using orientation tensors, parametric motion, and simultaneous segmentation of the motion field”, Proc. ICCV, vol. 1, pp. 77–80, 2001. [10] G. Farnebäck, “Fast and accurate motion estimation using orientation tensors and parametric motion models”, Proc. ICPR, vol. 1, pp. 135–139, 2000. [11] B. Jähne, H. Haussecker, H. Scharr, H. Spies, D. Schmundt, U. Schur, “Study of dynamical processes with tensor-based spatiotemporal image processing techniques”, Proc. ECCV, vol. 2, pp. 322-336, Jan. 2000. [12] G. Farnebäck, “Orientation estimation based on weighted projection onto quadratic polynomials,” Proc. Vision, Modeling, and Visualization 2000, pp. 89-96, Nov. 2000. [13] L. Haiying, C. Rama, R. Azriel, “Accurate dense optical flow estimation using adaptive structure tensors and a parametric model”, IEEE Trans. Image Processing, vol. 12, pp. 1170-1180, Oct. 2003. [14] H. Wang, K. Ma, “Structure tensor-based motion field classification and optical flow estimation,” Information, Communications and Signal Processing 2003, vol. 1, pp 66-70, Dec. 2003. [15] B. Johansson, G. Farnebäck, “A theoretical comparison of different orientation tensors”, Proceedings SSAB02 Symposium on Image Analysis, pp. 69-73, Mar. 2002. [16] H. Haussecker, H. Spies, “Handbook of Computer Vision and Application”, Vol. 2, ch 13, Academic, New York, 1999.

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 5, SEPTEMBER 2007

[17] B. Jähne, “Spatio-Temporal Image Processing”, Theory and Scientific Applications. Springer Verlag, Berlin, Germany, 1993. [18] Xilinx, http://www.xilinx.com/univ/xupv2p.html. [19] http://www.ece.byu.edu/roboticvision/helios/ [20] J. Groß, “Linear Regression”, Lecture Notes in Statistics, Springer Verlag, Berlin, Germany, 2003. [21] Z. Wei, D.J. Lee, B. Nelson, M. Martineau, “A Fast and Accurate Tensor-based Optical Flow Algorithm Implemented in FPGA”. IEEE WACV ’07, p18, 2007.

Zhaoyi Wei is a Ph.D. candidate in Electrical and Computer Engineering Department in Brigham Young University, Provo, Utah, USA. His research interests include real-time motion estimation, segmentation algorithm and optimization using hardware; applications of motion estimation, segmentation algorithms. He received his M.E. and B.E. with honors from Northeastern University, Shenyang, Liaoning, China. He is a student member of IEEE.

Dah-Jye Lee received his B.S. from National Taiwan University of Science and Technology in 1984, M.S. and Ph.D. degrees in electrical engineering from Texas Tech University in 1987 and 1990, respectively. He also received his MBA degree from Shenandoah University, Winchester, Virginia in 1999. He is currently an Associate Professor in the Department of Electrical and Computer Engineering at Brigham Young

© 2007 ACADEMY PUBLISHER

45

University. He worked in the machine vision industry for eleven years prior to joining BYU in 2001. His research work focuses on Medical informatics and imaging, shape-based pattern recognition, hardware implementation of real-time 3-D vision algorithms and machine vision applications. Dr. Lee is a senior member of IEEE and a member of SPIE. He has actively served as a paper and proposal reviewer and conference organizer. He has served as the editor, general chair, and steering committee member of the IEEE International Symposium of Computer-based Medical Systems. He received the best faculty advisor award from Brigham Young University Student Association in 2005.

Brent E. Nelson received his B.S., M.S., and Ph. D. degrees in 1981, 1983, and 1984 respectively in Computer Science from the University of Utah. He is currently a Professor in the Department of Electrical and Computer Engineering at Brigham Young University. His research work focuses on FPGA-based reconfigurable computing, specifically FPGA-based real-time signal and image processing and CAD Tools for reconfigurable computing. Dr. Nelson is a senior member of the IEEE and actively serves on the program committees for a collection of reconfigurable computing conferences. He served as department chair of Electrical and Computer Engineering at Brigham Young University from 1993 to 1997 and currently serves as Computer Engineering program head in the department.

Suggest Documents