Viewpoint-Independent Object Detection Based on Two ... - imLab

1 downloads 0 Views 1MB Size Report
Ping-Han Lee, Yen-Liang Lin, Shen-Chi Chen, Chia-Hsiang Wu, Cheng-Chih Tsai, ...... [8] X. Tan, J. Li, and C. Liu, “A video-based real-time vehicle detection.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 12, NO. 4, DECEMBER 2011

1599

Viewpoint-Independent Object Detection Based on Two-Dimensional Contours and Three-Dimensional Sizes Ping-Han Lee, Yen-Liang Lin, Shen-Chi Chen, Chia-Hsiang Wu, Cheng-Chih Tsai, and Yi-Ping Hung

Abstract—We propose a viewpoint-independent object-detection algorithm that detects objects in videos based on their 2-D and 3-D information. Object-specific quasi-3-D templates are proposed and applied to match objects’ 2-D contours and to calculate their 3-D sizes. A quasi-3-D template is the contour and the 3-D bounding cube of an object viewed from a certain panning and tilting angle. Pedestrian templates amounting to 2660 and 1995 vehicle templates encompassing 19 tilting and 35 panning angles are used in this study. To detect objects, we first match the 2-D contours of object candidates with known objects’ contours, and some object templates with large 2-D contour-matching scores are identified. In this step, we exploit some prior knowledge on the viewpoint on which the object is viewed to speed up the template matching, and the viewpoint likelihood for each contour-matched template is also assigned. Then, we calculate the 3-D widths, heights, and lengths of the contour-matched candidates, as well as the corresponding 3-D-size-matching scores. The overall matching score is obtained by combining the aforementioned likelihood and scores. The major contributions of this paper are to explore the joint use of 2-D and 3-D features in object detection. It shows that, by considering 2-D contours and 3-D sizes, one can achieve promising object detection rates. The proposed algorithms were evaluated on both pedestrian and vehicle sequences. It yielded significantly better detection results than the best results reported in PETS 2009, showing that our algorithm outperformed the state-of-the-art pedestrian-detection algorithms. Index Terms—Object detection, pedestrian detection, vehicle detection.

I. I NTRODUCTION

D

ETECTING objects such as vehicles or pedestrians plays an essential role in intelligent transportation systems. Depending on scenarios, different computer vision algorithms were developed; some algorithms focused on automotive camera to assist drivers [1]–[3], whereas others worked on stationary cameras on roads to monitor the traffic [4]–[8]. In [9], a good survey on pedestrian detection is given. Detecting vehicles Manuscript received March 30, 2010; revised December 5, 2010 and June 5, 2011; accepted August 8, 2011. Date of publication September 22, 2011; date of current version December 5, 2011. This work was supported by the National Science Council, Taiwan, under Grant NSC 98-2221-E-002-127MY3. The Associate Editor for this paper was P. Grisleri. P.-H. Lee is with MediaTek Inc., Hsinchu 30078, Taiwan. Y.-L. Lin and Y.-P. Hung are with the Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei 10617, Taiwan. S.-C. Chen, C.-H. Wu, and C.-C. Tsai are with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITS.2011.2166260

or pedestrians using stationary monocular cameras on roads has many applications since, nowadays, the density of cameras on the urban roads is quite high. Algorithms can take advantage of this scenario to achieve better efficiency and accuracy. First, since the camera is stationary in this scenario, we can estimate the camera parameters and explicitly calculate the 3-D sizes of objects. These 3-D sizes can be used as an additional cue, aside from the information from 2-D intensities. Second, since we are interested in moving vehicles or pedestrians in such scenario, we can use the foreground detection algorithms to detect moving foreground and detect vehicles or pedestrians only in the foreground regions. This strategy speeds up the detection and lowers the false detections in the background regions at the same time. This paper will focus on such a scenario, and we propose an algorithm that detects moving objects using both 2-D and 3-D features. Several 2-D features (i.e., derived from image intensities) were proposed for vehicle and pedestrian detection, e.g., Haarlike features [10]–[12] and histograms of oriented gradients [13], [14]. The contours of objects are more robust to changes in appearance, owing to different illumination conditions or colors of vehicles or pedestrians. Gavrila and Philomin proposed to detect objects based on their contours [15]. They collected a bunch of contour templates, each one being a binary map in which an “on” pixel denotes the presence of a feature and an “off” pixel denotes the absence of a feature. To detect objects in an image, they found edges in this image, and the chamfer distances [16] between contour templates and the edge image were calculated. The work in [15] was extended in [17] by adding an additional verification stage, which verified objects based on their image intensities. A recent work of Gavrila and Munder [18] constructed a multicue stereo-based pedestriandetection and tracking system that employed a contour-based detection and texture-based classification scheme. Aside from chamfer distance, Felzenszwalb [19] applied the Hausdorff distance as a contour comparison measure for object detection. The contour templates collected in most aforementioned works seem to be limited in terms of the viewpoints. It is not clear how their algorithms work in detecting objects under arbitrary viewpoints. Aside from 2-D features, some used 3-D features in object detection. The 3-D features of objects are their sizes or locations in 3-D space, and these features can be useful in object detection. Miao et al. [20] reduced about 30% of false pedestrian detections given by a typical object-detection algorithm based

1524-9050/$26.00 © 2011 IEEE

1600

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 12, NO. 4, DECEMBER 2011

on the 3-D object heights, widths, and distances to the camera center. Hoiem et al. [21] and Ess et al. [22] used the scene depth as a clue in object detection and showed that the estimated scene depth and inferred possible object locations can be mutually improved. Three-dimensional object sizes or locations can be calculated if we assume that objects are positioned on the ground and the camera parameters are known. Kanhere et al. [7] estimated camera parameters and tracked vehicles, which was defined as a cluster of 3-D feature points. Our previous work [23] also estimated the camera parameters and modeled objects as 3-D billboards. Aside from monocular cameras, Corneliu and Nedevschi [24] used a calibrated stereo acquisition system to extract 2-D and 3-D information for object detection. This paper considers a typical surveillance application where three conditions hold. 1) A monocular camera is used, and the camera is fixed. 2) Consecutive video frames are inputs to the algorithms. 3) The goal is to detect the moving pedestrians, vehicles, or other objects. We propose a generic object-detection algorithm based on both the 2-D contours and the 3-D sizes of object. To match objects’ 2-D contours and to calculate their 3-D sizes, object-specific quasi-3-D templates are proposed. A quasi3-D template is the contour and the 3-D bounding cube of an object viewed from a certain panning and tilting angle, and it can be generated using the corresponding 3-D computeraided design (CAD) model. The idea of combining many viewspecific detectors to a view-independent detection system has long been studied in the literature [25]. Typically, such methods require training samples taken from different views. With the aid of 3-D CAD models, the proposed algorithm can generate training samples to build a view-specific template. Furthermore, there exists works that created 2-D image patches using a 3-D CAD model [26]–[28]. However, these algorithms extract 2-D features only, whereas the proposed algorithm extracts both 2-D contours and 3-D sizes of objects. Fig. 1 shows some pedestrian templates. Compared with existing object-detection works, the distinctive features of the proposed algorithm are summarized here. 1) Most object-detection works use 2-D features, and a few consider 3-D information. The proposed algorithm uses both 2-D and 3-D features. 2) Unlike existing works using chamfer matching [15], [17], [18], [29] that generate contour templates using a bunch of limited training data, the proposed algorithm generates quasi-3-D object contour templates encompassing all viewpoints using 3-D CAD models. 3) These quasi-3-D object templates also enable the calculation of the 3-D widths, heights, and lengths of objects. Compared with [23], the 3-D sizes extracted in this work are more robust to variations in viewpoints. The proposed algorithm used a planar ground model (i.e., pedestrian on the ground) with known relative camera position and/orientation. The planar ground model was adapted in several works [18], [30]. Gavrila and Munder [18] multiplexed the estimated disparity map into N discrete depth ranges. To identify the pedestrian region of interest (ROI), they scanned

Fig. 1. Quasi-3-D pedestrian templates. At the center of this figure is a 3-D mesh of a walking pedestrian. Each template is the projection of the contour (white curvy lines) and the 3-D enclosing cube (red straight lines) of a pedestrian from 3-D space to 2-D image plane under a certain viewpoint. The upper ten templates have the same tilting angle (50◦ ) but different panning angles 0◦ , 20◦ , . . . , 180◦ ; the lower ten templates have the same tilting angle (0◦ ) but different panning angles 180◦ , 200◦ , . . . , 360◦ (or 0◦ , equivalently.)

the images of each depth range with windows related to the minimum and maximum extents of pedestrians, taking into account the ground plane location at a particular depth range. Bertozzi et al. [30] established the relationship between the image coordinates of the pedestrian bounding box and the world coordinates of the corresponding bounding box using the planar ground model and known camera parameters. A Kalman filter was applied to estimate the pedestrian position given this relation. The proposed algorithm used the planar ground model in a different way from both works. Instead of finding the pedestrian ROI or estimating their positions, we estimated the 3-D object width, height, and length and used the resulting 3-D features for subsequent object verification. The rest of this paper is organized as follows. Section II gives an overview of the proposed system. Section III describes the proposed algorithm, including the quasi-3-D templates (Section III-A), 2-D contour-matching scheme (Section III-B), viewpoint likelihoods (Section III-C), 3-D size calculation method (Section III-D), and the decision fusion scheme (Section III-E). The experimental results are given in Section IV, followed by conclusions and future work in Section V. II. S YSTEM OVERVIEW Fig. 2 summarizes the proposed algorithm. The algorithm involves the following steps: 1) Moving Blob Detection: To detect an object of a specific class, pedestrian, or vehicle, the proposed algorithm first detected moving blobs in the given video frame. We applied the codebook method [31] to detect moving foregrounds. The codebook algorithm adopted a quantization/clustering technique to construct a background model from a video sequence. It captured the periodic motion and handle illumination variations, and it was efficient in memory and speed. The technique that

LEE et al.: VIEWPOINT-INDEPENDENT OBJECT DETECTION BASED ON TWO-DIMENSIONAL CONTOUR

1601

Fig. 2. Overview of the proposed system. Given a video frame, we first detect moving foreground. Each connected component in the foreground is defined as a moving blob. To verify whether each moving blob belongs to a specific object class, first, its DT image is calculated and matched against known object contours, and the 2-D contour-matching score is calculated. Using the observation that the same object in consecutive frames should have close panning and tilting angles viewed by the camera, we calculate the viewpoint likelihood. Given camera parameters, we calculate the object’s 3-D width, height, and length and its 3-D matching score. The overall verification decision depends on both 2-D and 3-D information.

detected shadow in the normalized color components in YUV color space [32] was also applied to remove shadows in the foregrounds. Morphological operations were preformed on the shadow-free foregrounds to remove the noise, and connected components were found and defined as the moving blobs. 2) 2-D Contour Matching: To verify whether an object of certain class exists in a moving blob, take the pedestrian for example; we matched the 2-D contour of the moving blob with known pedestrians’ contours. In this step, we exhaustively searched objects of several possible sizes and locations in a moving blob. For each location, the 2-D contour-matching score (denoted as α) based on the Chamfer distance was calculated. The location (i.e., a rectangle on image) with large 2-D contour-matching score was identified as an object candidate. 3) Viewpoint-Likelihood Calculation: For each candidate, based on the observation that the same object in consecutive frames should have similar panning and tilting angles, we calculated the viewpoint likelihood (denoted as β) and combined it with the 2-D contour-matching score to penalize the candidate with high 2-D contour-matching score but unreasonable changes in panning or tilting angles in consecutive frames. 4) 3-D Size Matching: For those candidates that achieved high 2-D contour-matching scores and viewpoint likelihoods, we estimated its 3-D length, width, and height. Then, these 3-D sizes were matched against known objects’ 3-D sizes, and the 3-D size-matching scores, which were denoted as γs, were calculated. 5) Decision Fusion: Given 2-D contour-matching, viewpoint-likelihood, and 3-D size-matching scores, we applied the mixture-of-experts model (MEM) [33] to obtain the overall object detection results. We encourage the reader to watch the supplementary demo video for a quick overview of this work. III. O BJECT D ETECTION U SING Q UASI -3-D T EMPLATES A. Quasi-3-D Templates We propose the Quasi-3-D templates for object detection. The Quasi-3-D templates can be used to perform 2-D contour matching and 3-D size calculation. A quasi-3-D template of an object class is the projection of the contour and the 3-D enclosing cube of that object from 3-D space to 2-D image plane under a certain viewpoint, and it is defined as Ωp,t = {X, p1 , p2 , p3 , p4 }, where {p, t} are the panning and tilting

angles on which the object is viewed, X is a binary map where pixels with value “1” indicate the existence of the contour, and p1 through p4 are four vertices on the 3-D enclosing cube. It is quasi-3-D, because it contains some 3-D information but it is not real 3-D data (i.e., 3-D point clouds). To handle the intraclass variation, each object class is divided into several object subclasses. For example, the subclasses of pedestrians could be “standing still” or “walking with arms raised;” the subclasses of vehicles could be “sedan,” “station wagon,” “truck,” or “bus.” A complete set of quasi-3-D templates for an object class can be (i) written as O = {Ωp,t ∀{p, t}, i = 1−K}, where i is the index of the subclass and K is the total number of subclasses that this object has. In practice, a set of discrete evenly sampled {p, t} is used for each object class. In this work, we defined four pedestrian subclasses: two were men, and two were women. We also defined three vehicle subclasses, which were sedan, wagon, and hatchback. For each object subclass, we prepared the corresponding 3-D meshes and used the Autodesk Maya, which is a 3-D computer graphics software, to render the contours and the 3-D enclosing cube of the meshes under combinations of 35 panning angles ranging from 0◦ to 350◦ and 19 tilting angles ranging from 0◦ to 90◦ , resulting in 665 templates for each object subclass. This resulted in 2660 pedestrian templates and 1995 vehicle templates. Fig. 3 shows some pedestrian and vehicle templates, respectively. The set of pedestrian and vehicle templates used in (i) this work was denoted as P = {Ωp,t , p ∈ P, t ∈ T, i = 1–4} and (i)

V = {Ωp,t , p ∈ P, t ∈ T, i = 1–3}, respectively, where P = {0◦ , 10◦ , 20◦ , 30◦ , . . . , 350◦ } and T = {0◦ , 5◦ , 10◦ , 15◦ , . . . , 90◦ }. B. 2-D Contour Matching

Assuming that we find a moving blob, which is shown as the red rectangle in Fig. 4(a) and (b). A moving blob may contain a single object or multiple objects, and these objects may be partially occluded by each other. To detect these objects, we match the contours of quasi-3-D templates of several different sizes with many possible locations in the moving blob. The subregion in a moving blob that returns a small distance is inferred as an object candidate. To calculate the distance between the moving blob’s contour and the contour of a quasi-3-D template, we apply the Chamfer distance. Chamfer distance was proposed by Barrow et al. [34] and later refined by Borgefors [16]. Gavrila [17] applied this technique to detect pedestrians in videos.

1602

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 12, NO. 4, DECEMBER 2011

Fig. 3. Examples of pedestrian and vehicle templates. (a) Thirty-five pedestrian templates with t = 0◦ and different p’s. (b) Thirty-five pedestrian templates with t = 50◦ and different p’s. (c) Thirty-five templates with t = 30◦ and different p’s.

˜ and calculate its distance-transformed 3) Find edges in I, (DT) image. The DT image computes the distance of each pixel to the nearest edge pixels [34]. By binarizing this DT image, we obtain I˜E , as shown in Fig. 4(d). 4) IE gives the sharp edges of objects, but it also contains many inner edges, which will cause many false alarms in the subsequent template-matching process. I˜E , on the other hand, contains a rough edge band on object contours. To obtain sharp edges on object  , as shown contours, we AND IE with I˜E to obtain IE in Fig. 4(e).  , and it gives IDT shown in 5) Calculate the DT image of IE Fig. 4(f). Then, we exhaustively search possible object locations inside IDT by sliding windows of different sizes on IDT . We denote each such sliding window as IDT . 6) To match the contour of IDT to an object class O, (i) (i) all the Ωp,t ∈ O are considered. For each Ωp,t = {X, p1 , p2 , p3 , p4 }, we superimpose its contour (i.e., X) on IDT . The distance between X and IDT can be calculated using Dchamfer (X, IDT ) =

Fig. 4. Matching objects’ contour objects using quasi-3-D templates. Given a video frame shown in (a), we first find the foregrounds and a moving blob, as shown in (b). By combining the detail edges [see (c)] and the contour band [see (d)], we obtain the contour edges [see (e)]. We slide template windows of different sizes and locations on the DT image of (e) [see (f)] and find two contour-matched image subregions [see (g)]. We further verify the two subregions using their 3-D size and detect two partially occluded pedestrians, ˜ (c) IE . (d) I˜E . as shown in (h). (a) Input frame and I. (b) Moving blob and I.  . (f) I (e) IE DT and a IDT . (h) Results.

The calculation of the Chamfer distance between a quasi-3-D template and a moving blob is summarized here. 1) Assume that we have a video frame and the corresponding foregrounds, as shown in Fig. 4(a) and (b), respectively. The red rectangle indicates a moving blob. We denote the image patch enclosed by this moving blob in ˜ respectively. Fig. 4(a) and (b) as I and I, 2) Find edges in I, and denote the result as IE , as shown in Fig. 4(c).

1 |X|



IDT (t)

(1)

t,X(t)=1

where t is a pixel on the contour, |X| is the number of pixels belonging to the contour in X, and IDT (t) is the pixel value at location t on IDT . The advantage of matching a quasi-3-D template with the DT image rather than with the edge image is that the resulting similarity measure will be smoother as a function of the template transformation parameters. 7) Dchamfer (X, IDT ) falls in [0, 1]. We define the 2-D contour-matching score as α ≡ 1 − Dchamfer (X, IDT ).

(2)

If α is higher than a predefined threshold, then we infer that there is an objet candidate in IDT . Fig. 4(g) gives two matched templates. Fig. 4(h) shows the input frame superimposed with the matched quasi-3-D templates.

LEE et al.: VIEWPOINT-INDEPENDENT OBJECT DETECTION BASED ON TWO-DIMENSIONAL CONTOUR

Fig. 5. Calculation of the tilting angle on which the moving blob is viewed. The green lines are equal-tilting contour in the scene. C is the camera lens center, P is the projection point of C on the ground, and G is the middle point on the bottom of the moving blob.

We define Δt as (t − θ), t ∈ T . We assume that the error in the estimated θ is less than ±5◦ and construct the tilting angle likelihood function f (Δt), which is the N (0, 102 )1 that has nonzero values only when −5◦ ≤ Δt ≤ 5◦ , as shown in Fig. 6(a). This step gives a constraint on the tilting angles for (i) (i) Ωp,t ∈ O, and only the subset O = {Ωp,t , p ∈ P, −5◦ ≤ (t − ◦ θ) ≤ 5 , i = 1 ∼ K} will be considered in template matching. Typically, only two tilting angles out of total 19 tilting angles in T will be considered. 2) Panning Angle Likelihood: We assume that a moving object gradually changes its orientation (i.e., panning angle), and the change rate is less than ±30◦ /frame.2 Assuming that the matched template in the previous frame has the panning angles equal to p∗ , we define Δp as (p − p∗ ), p ∈ P , and construct the panning angle likelihood function g(Δp), which is the N (0, 202 )3 that has nonzero values only when −30◦ ≤ Δp ≤ 30◦ , as shown in Fig. 6(b). This step further gives a (i) constraint on the panning angles for Ωp,t ∈ O, and only the (i)

C. Viewpoint Likelihood and the Speedup for Template Matching The set of pedestrian and vehicle templates P and V used in this work includes 2660 and 1995 templates, respectively. Each object class has templates viewed under 35 panning angles and 19 tilting angles. With some assumptions, we can put constraints on the panning and tilting angles of templates and speed up the template matching. Specifically, with a calibrated camera, we know the exact tilting angle of an object at any location in a frame. We can use this tilting prior to put a constraint on the tilting angle of the templates. Furthermore, by assuming that an object gradually changes its orientation, we can put a constraint on the maximum difference between the panning angles of two consecutive frames. We also obtain the likelihood of specific tilting and panning angles on which an object is viewed based on the aforementioned assumptions. We define the viewpoint likelihood as β ≡ f (Δt) · g(Δp)

1603

(3)

where the first and the second term is the tilting angle likelihood and the panning angle likelihood, respectively. The succeeding sections give details of the two likelihoods. 1) Tilting Angle Likelihood: Consider a moving blob in Fig. 5, which is shown as the red rectangle, and G is its middle point on the bottom side. Assume that G lies on the ground. Using camera parameters, we calculate the coordinate of point G in 3-D space. (The calculation details will be given in Section III-D.) The 3-D coordinate of the lens center, which is denoted as C, is known, and its projection point on the ground, which is denoted as P , can also be calculated. Given C, P , and G, the tilting angle on which the moving blob is viewed, which is denoted as θ, can be computed using  −−→ −−→  GC × GP −1 . θ = cos −−→ −−→ GCGP  As the tilting angle of each quasi-3-D template is known, we should use only the subset of templates whose tilting angles are equal to or close to θ.

subset O = {Ωp,t , −30◦ ≤ (p − p∗ ) ≤ 30◦ , −5◦ ≤ (t − θ) ≤ 5◦ , i = 1 ∼ K} will be considered in template matching. Note that, comparing with f (Δt), g(Δp) includes more fuzziness, and it tends to penalize the panning angles that have a large difference to the estimated panning angle in the previous frame. Typically, only seven panning angles out of total 35 panning angles in P will be considered. In summary, using both tilting- and panning-angle likelihood functions, roughly only 2.1% of all the templates are actually matched for an object class. Meanwhile, since we only consider the templates with reasonable tilting and panning angles, we further reduced some false detections. D. 3-D Size Calculation

In the previous two sections, we identified some object candidates inside moving blobs. Object candidates are those whose 2-D contours match the object’s contour, yet we do not know whether their 3-D sizes match the object’s 3-D size. In this section, we will calculate the object candidates’ 3-D sizes, and based on these, object-specific classifier will perform final object verification. An object’s 3-D size is its width, height, and length, which are denoted as W , H, and L, respectively. Consider a pedestrian template superimposed on an image, as shown in Fig. 7. We know the four vertices of the 3-D enclosing cube on the image plane, which are p1 , p2 , p3 , and p4 , where pi = {xi , yi } is a pixel location on the image plane. In the following, we will calculate the corresponding four vertices in 3-D space, which are P1 , P2 , P3 , and P4 , where Pi = {Xi , Yi , Zi } is a voxel in 3-D space. We will calculate P1 , P2 , and P3 first. Based on P1 , we will calculate P4 . 1) Calculating P1 , P2 , and P3 : Since, typically, the objects of interest are on the ground, we assume that Z1 = Z2 = empirically select σ = 10. However, our algorithm is not sensitive to σ. gives ±900◦ /s when the sampling rate of the video is 30 frame/s. 3 Again, we empirically select σ = 20. Our algorithm is not sensitive to σ. 1 We

2 ±30◦ /frame

1604

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 12, NO. 4, DECEMBER 2011

Fig. 6. Viewpoint-likelihood functions. (a) Tilting-angle-likelihood function. (b) Panning-angle-likelihood function.

Using the fact that λ = c31 X + c32 Y + c33 Z + 1 and simplifying (7), we have ⎡ ⎤ c13 − c33 x ⎣ c23 − c33 y ⎦ Z 0 ⎡ ⎤ (c31 x − c11 )X + (c32 x − c12 )Y + (c34 x − c14 ) = ⎣ (c31 y − c21 )X + (c32 y − c22 )Y + (c34 y − c24 ) ⎦ . (8) 0

Fig. 7. Calculation of the 3-D size of a pedestrian using a pedestrian template. The four vertices of the 3-D enclosing cube on the 2-D image plane, i.e., p1 , p2 , p3 , and p4 , are projected onto 3-D space, resulting in P1 , P2 , P3 , and P4 . The 3-D width, height, and length of this pedestrian are computed using W = P1 P3 , H = P1 P4 , and L = P1 P2 , respectively.

Z3 = 0. Let us focus on P1 for the moment. With Z1 = 0, we rewrite the camera equation λp = K[R3∗3 |t3∗1 ]P as

⎤ ⎡ ⎤ X/λ x K R(1) R(2) |t ⎣ Y /λ ⎦ = ⎣ y ⎦ 1/λ 1 



(4)



(5)

where R(i) is the ith column in rotation matrix R. The three unknown variables X/λ, Y /λ, and 1/λ in (5) can be solved, and thus, X and Y can be obtained via multiplying the first and the second variables by λ. Similarly, we can also solve P2 = {X2 , Y2 , 0} and P3 = {X2 , Y2 , 0}. 2) Calculating P4 : Once we solve P1 = {X1 , Y1 , 0}, we can also solve P4 = {X4 , Y4 , Z4 }, with the assumption that X1 = X4 , Y1 = Y4 . We first rewrite the camera equation λp = K[R|t]P as ⎡ ⎤ ⎡ ⎤⎡ X ⎤ ⎡ ⎤ X x c11 c12 c13 c14 ⎢Y ⎥ ⎢Y ⎥ λ⎣ y ⎦ = K[R|t]⎣ ⎦ = ⎣ c21 c22 c23 c24 ⎦⎣ ⎦ (6) Z Z c31 c32 c33 c34 1 1 1 or, equivalently ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ c12 c13 c14 x c11 λ ⎣ y ⎦ = ⎣ c21 ⎦ X + ⎣ c22 ⎦ Y + ⎣ c23 ⎦ Z + ⎣ c24 ⎦ . c31 c32 c33 c34 1

(7)

Note that the third equation in (8) can be eliminated, and now, we have two equations with one unknown variable Z. Z can be estimated using the least-mean-square method.4 With P1 , P2 , P3 , and P4 solved, the object candidate’s 3-D size, which is defined as x = {W, L, H}, can be computed: W = dist(P1 , P3 ), L = dist(P1 , P2 ), and H = dist(P1 , P3 ), where dist(Pi , Pj ) is the distance between Pi and Pj in 3-D space. The resulted 3-D size of the object candidate will be verified using the object-specific classifier. Theoretically, the 3-D sizes of objects should be consistent in different scenes or in the same scene, but at different locations with different viewing angles. However, in practice, there are two major sources of errors that make 3-D sizes noisy. One is the object localization error in a “noisy” background, and the other is the error in coordinate transformation from the 2-D image plane to 3-D space. Both errors were assumed to be Gaussian, and we constructed a generative model to model the 3-D size of each object class. Five hundred pedestrian and 300 vehicle locations were manually marked in several video sequences with different camera viewing angles. The resulting 3-D sizes for each object class were fit into a normal distribution: N (μp , Σp ) for pedestrians and N (μv , Σv ) for vehicles, where μp = [ 56.8 55.5 163.9 ]T ⎡ ⎤ 1208.3 −107.9 −1001.5 Σp = ⎣ −107.9 362.4 58.1 ⎦ −1001.5 58.1 1188.1 μv = [ 168.4 359.9 134.1 ]T ⎡ ⎤ 740.3 762.8 324.5 Σv = ⎣ 762.8 6012.0 197.9 ⎦ . 324.5 197.9 294.3 4 Kanhere et al. [7] also derived how to solve X , Y , and Z between a pair 1 1 4 {P1 , P4 } in a similar scenario, and basically, our derivation yields the same result as theirs.

LEE et al.: VIEWPOINT-INDEPENDENT OBJECT DETECTION BASED ON TWO-DIMENSIONAL CONTOUR

We define the 3-D size-matching score as γ ≡ exp

− 12 (x−μ)T Σ−1 (x−μ)

(9)

where x = {W, L, H} is the 3-D size, and (μ, Σ) = (μp , Σp ) or (μv , Σv ). Algorithm 1 summarizes the proposed algorithm, where (i)  O = {Ωp,t , −30◦ ≤ (p − p∗ ) ≤ 30◦ , −5◦ ≤ (t − θ) ≤ 5◦ , i = 1 ∼ K} is the subset of all templates (see Section III-C), and the thresholds Tα , Tαβ and Toverall were empirically determined.

where y is the overall MEM prediction; K is the number of experts; gp is the gating function corresponding to the pth expert, for p = 1−K; and yp is the prediction for the pth expert. The logic under MEM is that choosing the best expert in practice is difficult and combining a number of different model structures with proper weights can potentially provide a better prediction than a single expert. We treat the 2-D contourmatching scores (i.e., β · α) and 3-D size-matching score (i.e., γ) as predictions given by two experts. Following MEM, our overall matching score can be defined as

Algorithm 1 The Proposed Object-Detection Algorithm Require: An input frame, which is denoted as I. A set of quasi-3-D template O. Empirically determined thresholds Tα , Tαβ , and Toverall . 1: Detect the moving foregrounds. Assume that we find N moving blobs enclosed by rectangles Ri , i = 1 ∼ N , where Ri = {x, y, w, h}, {x, y} is the upper left position of this rectangle in the frame, and {w, h} is the width and height of this rectangle. 2: for i = 1 ∼ N do 3: Exhaustively search the subrectangles of different scales that occupy different positions within Ri . Assume that it yields rj subrectangles, j = 1 ∼ ni , where ni is the number of rectangles searched in Ri and rj = {x, y, w, h}. 4: for j = 1−ni do 5: for all Ωp,t ∈ O that do 6: Calculate α using (2). 7: if α > Tα then 8: Calculate β using (3). 9: if α > Tαβ then 10: Calculate γ using (9). 11: Calculate the overall matching score, which is denoted as score, using (11). 12: if score > Toverall then 13: We infer that the region enclosed by rj contains an object. 14: end if 15: end if 16: end if 17: end for 18: end for 19: end for

E. Fusion of 2-D and 3-D Matching Results To fuse the information given by 2-D contour matching and 3-D size matching, we applied the MEM [33]. The MEM introduced by Jacobs et al. [33] specified that a prediction is made up of a series of predictions from separate models, or experts, and each of them is weighted by a quantity determined by a so-called gating function. A general MEM had the following form: K  gp yp (10) y= p=1

1605

y = g1 · β · α + g2 · γ

(11)

where g1 and g2 are the weights for the 2-D contour-matching and 3-D size-matching scores, respectively. There are several ways to determine optimal g1 and g2 . For example, one can perform the grid search to find the best g1 and g2 with respect to the overall verification rates or evaluate the verification rates solely using 2-D contour matching and 3-D size matching, or assign a larger weight to the one with better accuracy. However, in our experiment, we did not obtain significant better performance by tuning the weights. It may be because we did not have enough training data to find good weights. Since we found that the 2-D contour-matching and 3-D size-matching algorithms have similar performances, in our experiment, we empirically set g1 = g2 = 0.5. It is possible to further boost the performance of the proposed algorithm by estimating better g1 and g2 , given a larger set of training data.

IV. E XPERIMENTAL R ESULTS A. Data Sets We evaluated the proposed algorithm on two publicly available data sets: PETS 2009 [35] (which is denoted as PETS09) and a traffic sequence from KOGS/IAKS University, Karlsruhe, Germany [36] (which is denoted as dtneuWinter), as well as a video sequence that we recorded in the campus of National Chiao Tung University, Taiwan (which is denoted as NCTU). PETS09 includes sequences of pedestrians viewed under eight different viewing angles. We used the most tested sequence in PETS 2009: Data set S2: People Tracking, scenario S2.L1, walking. We tested our algorithm under views 1 and 7. Each sequence has 796 images of size 768 by 576. View 1 is the recorded under a larger tilting angle, and view 7 is relatively hard due to the large amount of occlusions among pedestrians. dtneuWinter is the traffic sequence showing the intersection Karl-Wilhelm-Berthold-Strase, Karlsruhe, which is recorded by a stationary camera. This sequence has 301 images of size 768 by 576. NCTU includes both pedestrian and vehicles. It has 2006 images of size 320 by 240. Results of PETS09 give a direct comparison between the proposed algorithm and the stateof-the-art algorithms, whereas the results of dtneuWinter and NCTU demonstrate the effectiveness of the proposed algorithm for vehicle detection.

1606

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 12, NO. 4, DECEMBER 2011

TABLE I MEASURES EVALUATED IN OUR EXPERIMENTS, WHERE Git AND Dti DENOTE THE iTH GROUND-TRUTH AND DETECTED OBJECT IN FRAME t, RESPECTIVELY; Nframes IS THE NUMBER OF FRAMES IN THE SEQUENCE; t Nmapped REFERS TO FRAME t MAPPED GROUND TRUTH AND DETECTED t AND N t D ENOTE THE N UMBER OF G ROUND -T RUTH OBJECT PAIRS; NG D OBJECTS AND THE NUMBER OF DETECTED OBJECTS IN FRAME t, RESPECTIVELY; mt REPRESENTS THE MISSED DETECTION COUNT; AND fpt I S THE F ALSE P OSITIVE C OUNT

TABLE II PEDESTRIAN AND VEHICLE DETECTION RESULTS. “BEST IN [38]” IS THE BEST RESULTS ACHIEVED IN PETS 2009, AND THIS ALGORITHM CAN BE FOUND IN [39]. “MEDIAN IN [38]” IS THE MEDIAN RESULTS AMONG ALL AUTHORS IN PETS 2009

B. Evaluation Methodology The evaluation was based on the framework by Kasturi et al. [37], which is a well-established protocol for performance evaluation of object detection and tracking in video sequences. These measures are formally used in the Video Analysis and Content Extraction program and the CLassification of Events, Activities, and Relationships (CLEAR) consortium. We report the results of the proposed algorithm using the three object-detection measures defined in [37], i.e., SF DA, M ODA, and M ODP . These results give a direct comparison between the proposed algorithm and the results reported in PETS 2009 [38]. We summarize these measures in Table I. To evaluate the algorithm using these measures, we manually annotated all pedestrians and vehicles with both width and height greater than ten pixels in all the testing sequences (i.e., we specified the corresponding rectangles and the object class labels). C. Results and Discussions Table II gives the pedestrian and vehicle detection results with respect to SFDA, MODA, and MODP. Fig. 8 graphi-

Fig. 8. Object detection results. The red and green cubes show the detected pedestrian and vehicle, respectively. The first three rows show the pedestrian detection results of the PETS09 view 1, PETS09 view 7, and NCTU, respectively. The fourth and fifth rows show the vehicle detection results of NCTU and dtneuWinter, respectively.

cally shows some detection results. In PETS09 view 1, the proposed algorithm outperformed the best results reported in [38], which was proposed by Berclaz et al. [39]. The proposed algorithm yielded 10% larger SFDA and 10% larger MODP than [39], showing that it can quite accurately locate objects. Our algorithm also yielded 4% larger MODA than [39], which means that it has higher hit rate and lower false positive rate. These results demonstrate that the proposed algorithm outperforms the state-of-the-art algorithms in pedestrian detection.

LEE et al.: VIEWPOINT-INDEPENDENT OBJECT DETECTION BASED ON TWO-DIMENSIONAL CONTOUR

PETS09 view 7 is much harder than PETS09 view 1, primarily due to the large amount of occlusion among pedestrians. As can be seen in Table II, the median results obtained in PETS09 view 7 were much lower than those obtained in PETS09 view 1 (13%–19% lower). However, the proposed algorithm still obtained good results, which were 26%, 25%, and 20% better than median results in [38] with respect to SFDA, MODA, and MODP, respectively.5 One possible reason that the proposed algorithm performs better than others could be it exploits both 2-D and 3-D information, whereas most of others exploit only 2-D information. We obtained relatively good pedestrian and vehicle detection results in NCTU. In this sequence, both pedestrians and vehicles are large, and the occlusion was not severe. We regard dtneuWinter as the hardest sequence tested in our experiment. In this sequence, the vehicles were relatively small, and some of their colors were very similar to the roads. In addition, this sequence was recorded while snowing. All the preceding issues make the foreground detection a hard task. The proposed algorithm could fail without good foreground contours since, in this case, the 2-D contour matching may fail. Nevertheless, our algorithm obtained reasonable results. The proposed algorithm achieved 20 frame/s on a desktop with Intel Core2 Duo E8400 3-GHz central processing unit, when objects were not occluded. The frame rate dropped, depending on the degree of occlusion presented in the scene. For the three test videos in this study, it dropped to about 12 frame/s at most.

V. C ONCLUSION AND F UTURE R ESEARCH D IRECTIONS We have proposed an object-detection algorithm based on both objects’ 2-D contours and 3-D sizes. To match 2-D contours and to calculate the 3-D sizes of objects under arbitrary viewpoints, the quasi-3-D object template has been proposed. Quasi-3-D pedestrian templates amounting to 2660 and 1995 quasi-3-D vehicle templates are used in this study, and a speedup scheme is employed to reduce the number of templates matching. Note that the proposed algorithm is not restricted to detecting moving objects. It can also detect still objects by applying the exhaustive search scheme on the whole image at the cost of higher computational complexity. The major contributions of this paper is to explore the joint use of 2-D and 3-D features in object detection. It shows that, by considering 2-D contours and 3-D sizes, one can achieve promising object detection rates, even when objects are partially occluded. The preliminary results reported in this paper encourage more follow-ups toward this research direction, such as object tracking, segmentation, or activity recognition.

5 Reference [38] did not report the best result of PETS09 view 7. It reported the median result only.

1607

R EFERENCES [1] W. Li and H. Leung, “Simultaneous registration and fusion of multiple dissimilar sensors for cooperative driving,” IEEE Trans. Intell. Transp. Syst., vol. 5, no. 2, pp. 84–98, Jun. 2004. [2] J. Clanton, D. Bevly, and A. Hodel, “A low-cost solution for an integrated multisensor lane departure warning system,” IEEE Trans. Intell. Transp. Syst., vol. 10, no. 1, pp. 47–59, Mar. 2009. [3] D. Randeniya, S. Sarkar, and M. Gunaratne, “Vision-IMU integration using a slow-frame-rate monocular vision system in an actual roadway setting,” IEEE Trans. Intell. Transp. Syst., vol. 11, no. 2, pp. 256–266, Jun. 2010. [4] S. Gupte, O. Masoud, R. Martin, and N. Papanikolopoulos, “Detection and classification of vehicles,” IEEE Trans. Intell. Transp. Syst., vol. 3, no. 1, pp. 37–47, Mar. 2002. [5] Y.-K. Wang and S.-H. Chen, “A robust vehicle detection approach,” in Proc. IEEE Conf. Adv. Video Signal Based Surveillance, Sep. 2005, pp. 117–122. [6] N. Kanhere, S. Pundlik, and S. Birchfield, “Vehicle segmentation and tracking from a low-angle off-axis camera,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2005, vol. 2, pp. 1152–1157. [7] N. K. Kanhere, S. T. Birchfield, and W. A. Sarasua, “Vehicle segmentation and tracking in the presence of occlusions,” Intell. Transp. Syst. Veh.–Highway Autom., no. 1944, pp. 89–97, 2006. [8] X. Tan, J. Li, and C. Liu, “A video-based real-time vehicle detection method by classified background learning,” World Trans. Eng. Technol. Educ., vol. 6, no. 1, pp. 189–192, 2007. [9] M. Enzweiler and D. M. Gavrila (Dec. 2009). Monocular pedestrian detection: Survey and experiments. IEEE Trans. Pattern Anal. Mach. Intell. [Online]. vol. 31, no. 12, pp. 2179–2195. Available: http://dx. doi.org/10.1109/TPAMI.2008.260 [10] P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and appearance,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2003, vol. 2, pp. 734–741. [11] Y.-T. Chen and C.-S. Chen, “A cascade of feed-forward classifiers for fast pedestrian detection,” in Proc. Asian Conf. Comput. Vis., 2007, pp. 905–914. [12] W. C. Chang and C. W. Cho, “Online boosting for vehicle detection,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 3, pp. 892–902, Jun. 2010. [13] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2005, vol. 1, pp. 886–893. [14] J. Bégard, N. Allezard, and P. Sayd, “Real-time humans detection in urban scenes,” in Proc. Brit. Mach. Vis. Conf., 2007, pp. 1–10. [15] D. M. Gavrila and V. Philomin, “Real-time object detection for smart vehicles,” in Proc. IEEE Int. Conf. Comput. Vis., 1999, pp. 87–93. [16] G. Borgefors (Nov. 1988). Hierarchical chamfer matching: A parametric edge matching algorithm. IEEE Trans. Pattern Anal. Mach. Intell. [Online]. vol. 10, no. 6, pp. 849–865. Available: http://dx.doi.org/ 10.1109/34.9107 [17] D. M. Gavrila, “Pedestrian detection from a moving vehicle,” in Proc. Eur. Conf. Comput. Vis., 2000, pp. 37–49. [18] D. M. Gavrila and S. Munder, “Multi-cue pedestrian detection and tracking from a moving vehicle,” Proc. Int. J. Comput. Vis., vol. 73, no. 1, pp. 41–59, Jun. 2007. [19] P. F. Felzenszwalb, “Learning models for object recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2001, pp. I-1056–I-1062. [20] G. Miao, Y. Luo, Q. Tian, and J. Tang, “A filter module used in pedestrian detection system,” Artif. Intell. Appl. Innov., vol. 204, pp. 212–220, 2006. [21] D. Hoiem, A. Efros, and M. Hebert, “Putting objects in perspective,” Int. J. Comput. Vis., vol. 80, no. 1, pp. 3–15, Oct. 2008. [22] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, “A mobile vision system for robust multi-person tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [23] P.-H. Lee, Y.-L. Lin, T.-H. Chiu, and Y.-P. Hung, “Real-time pedestrian and vehicle detection in video using 3d cues,” in Proc. IEEE Int. Conf. Multimedia Expo, 2009, pp. 614–617. [24] T. Corneliu and S. Nedevschi, “Real-time pedestrian classification exploiting 2d and 3d information,” IET Intell. Transp. Syst., vol. 2, no. 3, pp. 201–210, Sep. 2008. [25] S. Z. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum, “Statistical learning of multi-view face detection,” in Proc. Seventh Eur. Conf. Comput. Vis., 2002, pp. 67–81. [26] J. Liebelt, C. Schmid, and K. Schertler, “Viewpoint-independent object class detection using 3d feature maps,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8.

1608

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 12, NO. 4, DECEMBER 2011

[27] C. Wiedemann, M. Ulrich, and C. Steger, “Recognition and tracking of 3d objects,” in Proc. 30th DAGM Symp. Pattern Recognit., 2008, pp. 132–141. [28] Y. Tsin, Y. Genc, and V. Ramesh, “Explicit 3d modeling for vehicle monitoring in nonoverlapping cameras,” in Proc. IEEE Int. Conf. Adv. Vid. Signal Based Surveillance, Washington, DC, 2009, pp. 110–115. [29] I. Katz and H. Aghajan, “Multiple camera-based chamfer matching for pedestrian detection,” in Proc. 2nd ACM/IEEE ICDSC, 2008, pp. 1–5. [30] M. Bertozzi, A. Broggi, R. Chapuis, F. Chausse, A. Fascioli, and A. Tibaldi, “Shape-based pedestrian detection and localization,” in Proc. IEEE Intell. Transp. Syst. Conf., 2003, pp. 328–333. [31] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, “Real-time foreground-background segmentation using codebook model,” Real-Time Imag., vol. 11, no. 3, pp. 172–185, Jun. 2005. [32] J. P. Zhou and J. Hoang, “Real-time robust human detection and tracking system,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.—Workshops, 2005, p. 149. [33] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixture of local experts,” Neural Comput., vol. 3, no. 1, pp. 79–87, Spring 1991. [34] H. Barrow, J. Tenenbaum, R. Bolles, and H. Wolf, “Parametric correspondence and chamfer matching: Two new techniques for image matching,” in Proc. Int. J. Conf. Artif. Intell., 1977, pp. 659–663. [35] [Online]. Available: http://www.cvg.rdg.ac.uk/PETS2009/a.html [36] [Online]. Available: http://i21www.ira.uka.de/ [37] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, “Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 319–336, Feb. 2009. [38] A. Ellis, A. Shahrokni, and J. Ferryman, “Pets2009 and winter-pets 2009 results: A combined evaluation,” in Proc. IEEE Int. Workshop Perform. Eval. Tracking Surveillance, Dec. 2009, pp. 1–8. [39] J. Berclaz, F. Fleuret, and P. Fua, “Multiple object tracking using flow linear programming,” in Proc. IEEE Int. Workshop Perform. Eval. Tracking Surveillance, 2009, pp. 1–8.

Ping-Han Lee received the B.Sc., M.Sc., and Ph.D. degrees from National Taiwan University, Taipei, Taiwan, in 2000, 2002, and 2010, respectively. He is currently with MediaTek Inc., Hsinchu, Taiwan, as a Senior Engineer. His research interests include face recognition, video surveillance, computer vision, and image processing.

Yen-Liang Lin received the B.S. degree in computer science and information engineering from Chang Gung University, Taoyuan, Taiwan, in 2007 and the M.S. degree in 2009 from the Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, where he is currently working toward the Ph.D. degree. His research interests include video surveillance, image content analysis, and retrieval.

Shen-Chi Chen was born in Taipei, Taiwan, in 1983. He received the B.S. degree in computer science from National Cheng Chi University, Taipei, in 2007 and the M.S. degree in biomedical engineering from National Chiao Tung University, Hsinchu, Taiwan, in 2009. He is currently working toward the Ph.D. degree in computer science and information engineering with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei. His research interests include computer vision, pattern recognition, surveillance systems, and intelligent transportation systems.

Chia-Hsiang Wu received the B.S. degree in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 2009. He is currently working toward the M.S. degree in computer science and information engineering with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. His research interests include face alignment, face recognition, face tracking, and interactive art.

Cheng-Chih Tsai received the B.S. degree in computer science and information engineering from National University of Tainan, Tainan, Taiwan, in 2009. He is currently working toward the M.S. degree in computer science and information engineering from National Taiwan University, Taipei, Taiwan. His research interests include computer vision, image processing, and human–computer interaction.

Yi-Ping Hung received the B.S. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1982 and the Ph.D. degree from Brown University, Providence, RI, in 1990. He is currently a Professor with the Graduate Institute of Networking and Multimedia and the Department of Computer Science and Information Engineering, National Taiwan University. From 1990 to 2002, he was with the Institute of Information Science, Academia Sinica Taiwan, where he became a tenured Research Fellow in 1997 and is now an Adjunct Research Fellow. From 1996 to 1997, he served as a Deputy Director of the Institute of Information Science. Since 2004, he has served on the Editorial Board of the International Journal of Computer Vision. His current research interests include computer vision, image processing, multimedia systems, and human–computer interaction. Dr. Hung has served as the Director of the Graduate Institute of Networking and Multimedia, National Taiwan University, since 2007. He was the Program Cochair of the 2000 Asian Conference on Computer Vision and the 2000 International Conference on Artificial Reality and Telexistence. He was also the Workshop Cochair for the 2003 IEEE International Conference on Computer Vision. He was the recipient of the Young Researcher Publication Award from Academia Sinica in 1997.

Suggest Documents