Meeting Room Configuration and Multiple Camera ... - People

In Proceedings of the Seventh International Conference on Multimodal Interfaces (ICMI 2005), Trento, Italy, October 2005.

Meeting Room Configuration and Multiple Camera Calibration in Meeting Analysis ∗ Yingen Xiong

Francis Quek

Virginia Polytechnic Institute and State University 621 McBryde Hall, MC0106 Blacksburg,VA 24061, USA

Virginia Polytechnic Institute and State University 618 McBryde Hall, MC0106 Blacksburg,VA 24061, USA

[email protected]

[email protected]

ABSTRACT In video based cross-model analysis of planning meeting, the meeting events are recorded by multiple cameras distributed in the entire meeting room. Subject’s hand gestures, hand motion, head orientations, gaze targets, body poses are very important for the meeting event analysis. In order to register everything to the same global coordinate system, build 3D model, get 3D data from the video, we need to create a proper meeting room configuration and calibrate all cameras to obtain their intrinsic and extrinsic parameters. However, the calibration of multiple cameras distributed in the entire meeting room area is a challenging task because it is impossible to let all cameras in the meeting room see a reference object at the same time and wide field-of-view cameras suffer under radial distortion. In this paper, we propose a simple approach to create a good meeting room configuration and calibrate multiple cameras in the meeting room. The proposed approach includes several steps. First, we create stereo camera pairs according to the room configuration and the requirements of the targets, the participants of the meeting. Second, we apply Tsai’s algorithm to calibrate each stereo camera pair and obtain the parameters in its own local coordinate system. Third, we use Vicon motion capture data to transfer all local coordinate systems of stereo camera pairs into a global coordinate system in the meeting room. We can obtain the positions, orientations, and parameters for all cameras in the same global coordinate system, so that we can register everything into this global coordinate system. Next, we do calibration error analysis for the current camera and meeting room configuration. We can obtain error distribution in the entire meeting room area. Finally, we improve the current camera and meeting room configuration according to the error dis∗This research has been supported by the Advanced Research and Development Activity ARDA VACEII grant 665661: From Video to Information: Cross-Model Analysis of Planning Meetings.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI’05, October 4–6, 2005, Trento, Italy. Copyright 2005 ACM 1-59593-028-0/05/0010 ...$5.00.

tribution. By repeating these steps, we can obtain a good meeting room configuration and parameters of all cameras for this room configuration.

Categories and Subject Descriptors I.4 [Image Processing and Computer Vision]: Digitization and Image Capture - camera calibration; H.5.m [Information Interfaces and Presentation]: Miscellaneous

General Terms Algorithms, Measurement, Design, Experimentation

Keywords Meeting Analysis, Meeting Room Configuration, Camera Calibration, Error Analysis

1.

INTRODUCTION

Meetings are gatherings of humans for the purpose of communication. As such, we argue that the understanding of human multimodal communicative behavior, and how witting or unwitting visual displays relate to such communication is critical. We address the ’Why’, ’What’, and ’How’ of meeting video analysis: ’Why’ are we doing this in the first place - is not the audio or speech transcription of the communication sufficient? ’What’ are the units of analysis for meeting video - since humans do not use rote presentation of a set of predefined gestural semaphores, what are the entities that we can access, and what models do we need to access them? ’How’ might one process the video to get a handle on the entities of access? To enable this research, we are assembling a planning meeting corpus that is coded at multiple levels. This corpus is ’hypothesis driven’ in that the coding is designed to support the multimodal language theories that undergird our research. In our meeting event analysis, we capture synchronized multimedia data (10 pair-wise stereo calibrated cameras, wireless fixed-distance directional microphones, table-mounted microphones, and a system of Vicon infrared 3D motion trackers) of the proceedings in the meeting room. The subjects in the meeting are instrumented non-obtrusively with infrared-reflective markers so that we have truthed data for head orientation, shoulder orientation, torso orientation, and hand motion. This helps to bootstrap our coding of the data and to train our vision/video processing systems. We engage video process-

ing/analysis research on the multichannel video to extract head orientation, hand motion, and posture tracking. With the same idea of cross-modal analysis for the planning meetings, to develop multimodal interfaces, one needs to understand the constraints underlying human communicative gesticulation and the kinds of features one may compute based on these underlying human characteristics. The 3D model and 3D data are very important for us to analyze the subject’s gestures and interactive/instrumental gaze activity in communication. With 3D hand movement data, one can extract hand gesture features [1, 2, 3] during the speech or conversation. These gesture features can be used to study the relationship between gesture and speech. With 3D head movement, one can extract head gestures and gaze targets [4, 5] and these are very important in the study of human multimodal communication. In order to obtain 3D data from videos for meeting analysis, we need to create a good configuration for meeting room and calibrate all cameras to register everything to the same global coordinate system. However, the calibration of multiple cameras distributed in the entire meeting room area is a challenging task because it is impossible to let all cameras in the meeting room see a reference object at the same time and wide field-of-view cameras suffer under radial distortion. In previous work, Bake and Aloimonos [6, 7] proposed an approach of having LEDs blink in a specified sequence and capturing a very large set of point correspondences. They used these correspondences to compute camera parameters and fundamental matrix with a large, nonlinear eigenvalue minimization routine. Their approach can be used in such calibration situations: the corridors of a building floor with clusters of cameras placed at intervals along the wall, the cameras arranged on the walls of a room and in a dome configuration, a network of cameras looking outwards. Barreto and Daniilidis [8] also proposed an approach of using LED to calibrate wide area multiple cameras. They obtained the correspondence between views by deliberately moving an LED in thousands of unknown positions in front of the cameras. Their approach can compute both projection matrices and radial distortion parameters without a single non-linear minimization or outlier treatment. However, one of the main disadvantages in this kind of approaches is that the LEDs are difficult to measure more accurately than a pixel. If accurate calibration is required, it is necessary to develop other techniques or approaches. In video base multimodal meeting analysis, meeting room configuration and camera calibration are very important. all information for the analysis comes from videos. Data acquisitions need an accurate camera calibration. In this work, we propose an approach for creating a good meeting room configuration and calibrating multiple cameras distributed in the meeting room. In this approach, first we create stereo camera pairs according to current meeting room configuration and the requirements of the targets. After making movies for calibration box in different positions, we apply Tsai’s algorithm [9, 10] to calibrate each camera pair in its own local coordinate system. We can obtain orientation, position, parameters, and fundamental matrix for each camera. Then we use Vicon motion capture data which record all positions of the calibration box to transfer all local coordinate systems into a global coordinate system in the meeting room. Through this process, we can obtain the positions, orientations, and parameters for all cameras in the

same global coordinate system. Next we do error analysis for the camera calibration. Finally, we improve the meeting room configuration according to the calibration error distribution. By repeating above steps, we can create a good configuration for the meeting room and obtain all parameters for all cameras corresponding to the configuration. In this paper, we also apply the proposed approach to the serial experiments of cross-modal analysis for planning meetings. In each experiment, we have 8 people in the meeting. The meeting events are recorded by 10 cameras and a Vicon motion capture system with 8 motion trackers. The cameras and motion trackers are installed in the fixed positions on the ceiling of the meeting room. According to the requirement which each participants should be seen by at least two camera pairs at the same time. We create 12 stereo camera pairs with these 10 cameras. With the approach proposed in this paper, we obtained the meeting room configuration and the camera calibration. The results are satisfying.

2.

APPROACH FOR MEETING ROOM CONFIGURATION

Figure 1 shows the whole process of the approach of meeting room configuration in the cross-modal analysis for planning meetings. In this process, first, we create initial configuration for the meeting room according to the meeting requirements, the information of participants including their positions, the range of their movements. With the requirements and information, we can decide the initial positions and orientations for all cameras. Second, we create stereo camera pairs according to the initial configuration of the meeting room. We require that each participant should be covered at least by two camera pairs. If at some reasons one camera pair fails to get the whole data for this participant, we can use another camera pair. Third, we make movies for the calibration box at different positions with the camera pairs. At the same time the positions of calibration box are tracked by Vicon motion capture system. There are a number of calibration dots and Vicon markers on the calibration box shown in figure 5. The calibration dots are for the camera calibration and the Vicon markers are for the Vicon motion capture system. The next, with these movies, we calibrate each stereo camera pairs using Tsai’s algorithm. we can obtain intrinsic and extrinsic parameters and fundamental matrix for each camera in the camera pair in its local coordinate system. Then, we create rotation and translation matrices with Vicon motion capture data. We transform all local coordinate systems of all camera pairs into a global coordinate system in the meeting room, so that we can register everything into the same system. Next, with the calibration results, we do calibration error analysis. We compute the calibration error distribution in the global coordinate system. With this error distribution, we can know in which area we will get small error and which area we will get large error with the current room and camera configuration. Finally, we can make our decision with the error distribution. If the calibration accuracy is good enough, we can use the current camera calibration and room configuration as our results. If not, improve the room configuration by adjusting the camera positions, orientations, and other properties to repeat the whole process until we obtain good results. In the whole process of meeting room configuration, the most important step is multiple camera calibration. With

Initial configuration according to the meeting requirements and the positions of the participants

Create camera pairs according to meeting room configuration. Make movies of calibration box with each camera pair. Create stereo image pairs (I i , I i ), i 1,2,n l

Multiple camera calibration

r

c

for all camera pairs. Create Vicon Tracking data

Create stereo camera pairs according to the room configuration and the positions of participants i=i+1

Make a movie for each position of the calibration box with cameras and track the calibration box with Vicon motion capture system Calibrate each stereo pair and obtain parameters and fundamental matrix for each camera in its own local coordinate system Transform all local coordinate systems of the stereo camera pairs into a global coordinate system in the meeting room using Vicon data Compute error distribution in entire meeting room area with the results of calibration error analysis The calibration accuracy is good enough? No Improve the configuration for the meeting room by adjusting the camera positions, orientations, and other properties

Yes Obtain room configuration and all parameters for all cameras

Figure 1: Meeting Room Configuration and Multiple Camera Calibration the calibration results and error distribution, we can improve the current configuration. With this approach, we can obtain a satisfactory meeting room and camera configuration.

3. APPROACH FOR MULTIPLE CAMERA CALIBRATION 3.1 Camera Calibration Since the multiple camera calibration is the most important process in the meeting room configuration, in this section, we describe it in detail. Figure 2 show the whole process of multiple camera calibration in the meeting room area. We create nc camera pairs with all cameras used in recording meeting events in the meeting room according to the meeting room configuration and the requirements of the participants. To calibrate our camera pair for stereo viewing we use Roger Y. Tsai’s versatile camera calibration algorithm described in [9]. The problem with calibrating two cameras for stereo viewing is that to calibrate them it is necessary to record a calibration target with points of known coordinates. In order to do that, For each camera pair, we make a movie with calibration box in which there are a number of calibration points and Vicon markers and their 3D positions in calibration box local coordinate system are known . We let both cameras see the calibration box and create stereo image pair (Ili , Iri ), i = 1, 2, . . . , nc . Tsai’s algorithm requires at least 11 calibration points, but the normally used number is between 20 and 60.

Left Image

I li

Right Image

I ri

Vicon Data

Calibrate camera pair. Compute intrinsic and extrinsic parameters for left camera and right camera in local calibration box coordinate system with Tsai’s algorithm: f, Cx, Cy, sx , kappal, Tx, Ty, Tz, Rx, Ry, Rz With Vicon data and the 3D positions of Vicon markers on the calibration box, compute the rotation matrix Ri and the translation matrix Ti to transfer the local calibration box coordinate system into the Vicon global coordinate system Yi=RiX+Ti Calibration error analysis in the local calibration box coordinate system

Calibration error analysis in Vicon global coordinate system

Obtain parameters of camera pair i in Vicon global coordinate system: [Pli Pri]ńP Pli=[ fl, Clx, Cly, slx, kappall, Tlx, Tly, Tlz, Rlx, Rly, Rlz] Pri=[ fr ,Crx, Cry, srx, kappalr, Trx, Try, Trz, Rrx, Rry, Rr] Compute error distribution in meeting area. Create error distribution surface. Analyze calibration error for each camera.

Obtain intrinsic and extrinsic parameters Pl and Pr , fundamental matrix for all cameras, error distribution in meeting area, and translation and rotation matrix T and R for coordinate system transform.

Figure 2: Multiple Camera Calibration Approach in Meeting Analysis With this calibration algorithm, we can obtain the calibration results for each camera in a camera pair. The results include 11 parameters. Among them, there are five intrinsic parameters f, Cx, Cy, sx , kappal f - effective focal length of the pin hole camera, kappa1 - 1st order radial lens distortion coefficient, Cx, Cy - coordinates of center of radial lens distortion -andthe piercing point of the camera coordinate frame’s Z axis with the camera’s sensor plane, sx - scale factor to account for any uncertainty due to framegrabber horizontal scanline resampling, and six extrinsic parameters T x, T y, T z, Rx, Ry, Rz. Rx, Ry, Rz - rotation angles for the transform between the world and camera coordinate frames, and T x, T y, T z - translational components for the transform between the world and camera coordinate frames. The intrinsic parameters describe how the camera forms an image while the extrinsic parameters describe the camera’s pose (i.e. position and orientation) in the world coordinate frame. We also need to compute fundamental matrix for each camera in the camera pair. The approach is described below. Consider the case of two perspective images of a rigid scene. The geometry of the configuration is shown in figure

3.2

P

Pr=(ur vr 1)T

Pl=(ul vl 1)T

er

el Ol

Or Left image

Right image

Figure 3: Epipolar Geometry 3. The 3D point P projects to point Pl = (ul vl 1)T in the left image and Pr = (ur vr 1)T in the right image. The epipolar constraint equation can be expressed as, PrT F Pl = 0

(1)

where, F is fundamental matrix. ⎛

f11 F = ⎝ f21 f31

f12 f22 f32

⎞ f13 f23 ⎠ f33

(2)

(3)

let a = (uu uv u vu vv v u v )T and f = (f11 f12 f13 f21 f22 f23 f31 f32 f33 )T . We can obtain T

a f=0

f =1

(5)

where each row of A is built from the coordinates Pl and Pr of a single match. Using this approach, we can calibrate each camera pair and obtain all parameters including intrinsic parameters, extrinsic parameters, and fundamental matrix for each camera in the camera pair. These parameters are in the local coordinate system of the camera pair. After calibrating all camera pairs, we need transfer all of these local coordinate systems into a global coordinate system in the meeting room, so that we can register everything into the same coordinate system, perform calibration error analysis, and create error distribution in the entire meeting room. We will describe the transformation of coordinate system using Vicon data in next section.

(6)

where, R is the rotation matrix and T is the translation matrix. ⎞ ⎛ ⎞ ⎛ T1 r11 r12 r13 (7) R = ⎝ r21 r22 r23 ⎠ , T = ⎝ T2 ⎠ r31 r32 r33 T3 Suppose we have n Vicon markers on calibration box and their positions are X = (X1 X2 ... Xn )T where,

(4)

In equation 4 we have 9 elements and we need at least 8 corresponding points to solve this equation. We usually have more than the minimum number of points, but these are perturbed by noise, so we will look for a least squares solution: min Af 2

When we calibrate each stereo camera pair, the original point of the 3D coordinate system is on the calibration box. During the calibration process, we change the box position for each camera pair, so every camera pair has its own 3D local coordinate system. We need to transform them into a 3D world coordinate system. We call this 3D world coordinate system as the Vicon global coordinate system. During the calibration process, we let the stereo camera pair see the calibration dots and Vicon markers on the calibration box and make movies for them. The Vicon motion capture system will record the coordinates of Vicon markers in the Vicon global coordinate system. At the mean time, we also have the coordinates of Vicon Markers in 3D local coordinate system of the calibration box. With these coordinates of Vicon markers in two coordinate systems, we can transform 3D local coordinate system of calibration box into the Vicon global coordinate system, so that every stereo camera pair will use the same world coordinate system. Let Y denote the new coordinate system (the Vicon global coordinate system) and X denote the old coordinate system (3D local coordinate system of the calibration box). We can transform X into Y by equation 6, Y = RX + T

From equation 1 we obtain uu f11 + uv f12 + uf13 + vu f21 + vv f22 + vf23 +u f31 + v f32 + f33 = 0

Coordinate System Transformation

(8)

⎧ X1 = (x11 x12 x13 ) ⎪ ⎪ ⎨ X2 = (x21 x22 x23 ) ··· ⎪ ⎪ ⎩ X = (x x x ) n n1 n2 n3

(9)

Similarly, for Y , we also have Y = (Y1 Y2 ... Yn )T where,

⎧ Y1 = (y11 y12 y13 ) ⎪ ⎪ ⎨ Y2 = (y21 y22 y23 ) ··· ⎪ ⎪ ⎩ Y = (y y y ) n n1 n2 n3

From above equations, we have ⎛ y21 − y11 y22 − y12 y23 − y13 ⎜ y31 − y11 y32 − y12 y33 − y13 ⎜ ⎜ .. ⎝ . yn1 − y11 yn2 − y12 yn3 − y13 ⎛ ⎜ ⎜ ⎜ ⎝

x21 − x11 x31 − x11 .. . xn1 − x11

(10)

x22 − x12 x32 − x12 xn2 − x12

(11)

⎞ ⎟ ⎟ ⎟= ⎠

⎞ x23 − x13 ⎛ r11 x33 − x13 ⎟ ⎟⎝ r12 ⎟ ⎠ r13 xn3 − x13

r21 r22 r23

⎞ r31 r32 ⎠ r33 (12)

Y

Equation 12 can also be expressed as

660 1

108

(13)

60 108

AZ = b

2

30o

108

4

108

5

60

⎛ ⎜ ⎜ A=⎜ ⎝

x21 − x11 x31 − x11 .. . xn1 − x11 ⎛

⎜ ⎜ b=⎜ ⎝

x22 − x12 x32 − x12 xn2 − x12

y21 − y11 y31 − y11 .. . yn1 − y11

⎛

r11 Z = ⎝ r12 r13

r21 r22 r23

y22 − y12 y32 − y12 yn2 − y12

⎞

6

x23 − x13 x33 − x13 ⎟ ⎟ ⎟ ⎠ xn3 − x13 ⎞ y23 − y13 y33 − y13 ⎟ ⎟ ⎟ ⎠ yn3 − y13

⎞ r31 r32 ⎠ r33

48 Calibration Dots

3 n−1 k=1 i=1

(bik −

3

aij Zjk )2 = b − AZ22

46 40

34 22

28

23

29

24

30

47 41

35

17

48 42

36

18

180

(14)

60

X 48 Calibration Dots

Z

Figure 5: Calibration Box and Vicon Markers

4.2 (15)

(16)

(17)

j=1

By minimization equation 17, we will obtain rotation matrix R, and with equation 6 and X1 , Y1 , we can also get translation matrix T T = Y1 − RX1

45 39

180

This is the over-determined problem. We use least squares approach to solve it, i.e. e=

32 33

16

180

44 38

27

21

11

60

26

15

12

Calibration Box

20

14 9 10

18 Vicon Markers

19

25

660

108

3

43 37

31

13 8

where,

660

7

(18)

For camera pair i, i = 1, 2, . . . , nc , we can get a rotation matrix Ri and a translation matrix Ti . We can use them to transform its local coordinate system into the Vicon global coordinate system. Once we did this transformation for all camera pairs, we can register everything in the same world coordinate system, i.e. the Vicon global coordinate system in the meeting room.

4. EXPERIMENTS AND RESULT ANALYSIS 4.1 Experiment Setup Figure 4 shows the configuration of meeting room. There are eight people labeled A B C D E F G H in the meeting. Ten movie cameras labeled C1 C2 C3 C4 C5 C6 C7 C9 C10 are installed to record the meeting events. T1 and T2 are two table microphones. Every camera is installed in a fixed position on the ceiling of the meeting room, so that each camera can see certain people at the same time. In this configuration, camera C1 is for D E F, C2 is for H G F, C3 is for F E, C4 is for H A, C5 is for F G H, C6 is for B A H, C7 is for D C B, C8 is for B A, C9 is for D E, and C10 is for B C D. We also installed a Vicon motion capture system in this meeting room to provide us ground truth. Eight Vicon infra red cameras are installed in fixed positions on the ceiling of the meeting room, so that they can track the Vicon markers installed on the targets and provide us motion data and the positions of the targets.

Camera Calibration

In order to obtain 3D information, we set these cameras as 12 stereo camera pairs such as 1–C9C3, 2–C1C3, 3–C9C1, 4–C4C8, 5–C4C6, 6–C6C8, 7–C7C10, 8–C2C5, 9–C2C4, 10– C3C5, 11–C7C9, and 12–C8C10. We use these camera pairs in our calibration process, so that we can obtain intrinsic and extrinsic parameters for all cameras. Figure 5 shows the calibration box used in the calibration process. We made 48 dots which 3D coordinates are known on this calibration box. In the calibration process, we let each camera in the stereo camera pair see these dots. Then we apply Tsai’s approach to compute intrinsic and extrinsic parameters and fundamental matrix of the camera. We also installed 18 Vicon markers which the 3D coordinates are known on the calibration box. With the positions of these Vicon markers in local coordinate system of calibration box and in the Vicon global coordinate system, we can transfer the coordinates for one coordinate system into another. For each stereo camera pair we put the calibration box in one position and take pictures with both cameras, so we obtain a stereo image pair for each camera pair. Figure 6 shows the positions of the calibration box in the whole calibration process. Because the real positions of these dots and Vicon markers on calibration box are known, we can use them to calibrate the camera pair. When we need do calibration error analysis and distribution, we set the calibration box to more positions to cover the entire meeting room. Through the whole calibration process described in section 3.1, we can calibrate all camera pairs. For each camera, we can obtain 11 parameters including 5 intrinsic parameters and 6 extrinsic parameters and fundamental matrix. In the next step, we can do calibration error analysis and create error distribution in the meeting room.

4.3

Calibration Results and Analysis

After calibrating cameras in each stereo camera pair, we can perform calibration error analysis. The results are show in figure 7 and 8. From figure 7 and 8 we can see that for camera pairs 5 to 12 we obtain very small calibration errors. In X direction the maximum error is 0.5886mm, minimum error is 0.4mm, and the mean error is 0.4755mm. In Y direction the maximum error is 0.6925mm, minimum error is 0.3077mm, and the mean error is 0.4529mm. In Z direction the maximum error is 0.5064mm, the minimum error is 0.3804mm, and the mean error is 0.4317mm. Figure 9 to 11 show the mean error distributions in the entire meeting room area in X, Y , and Z directions For camera pair 2 including camera 1 and 3 and camera pair 3 including camera 9 and 1, the errors are little

H

C1 C2 C3 C4 C5

DEF HGF FE HA FGH

E

D

C6

F

G

E

H

G

A

F

G

BAH

C7

DCB

C8

BA

C9

DE

C10

BCD

H

1

C9C3

7

C7C10

2

C1C3

8

C2C5

3

C9C1

9

C2C4

4

C4C8

10

C3C5

5

C4C6

11

C7C9

6

C6C8

12

C8C10

F

H

A

B

D C

D

E

B

A

B

C

D

B

Figure 4: Meeting Room Configuration

Trial Cal01 Z(mm) 1500

Cal02

10−1

Cal03

1000 8−1

Cal04 500

Cal05 2−1

1−1

9−1

0

Cal06 11−1 5−1

3−1

Cal07 −500 6−1

Cal08

7−1 4−1

−1000

Cal09 12−1 −1500 −2000

−1500

−1000

−500

0

500

1000

1500

2000 X(mm)

Cal10

Cal11

Figure 6: The Positions of Calibration Box in Calibration

Cal12

Box Coordinate System Vicon World Coordinate System Camera Error Type Pair X(mm) Y(mm) Z(mm) X(mm) Y(mm) Z(mm) Minimum 0.8345 0.6802 0.2783 0.6056 0.5659 0.2805 9,3 Mean 25.8170 24.0527 29.8695 19.0394 34.6596 24.1948 Maximum 143.7288 67.7405 150.9827 89.7858 190.2071 68.2796 Minimum 0.0542 0.0688 0.1040 0.0656 0.0740 0.0670 1,3 Mean 2.1680 1.9321 5.3695 5.2807 2.6116 1.9435 Maximum 7.0796 5.2384 20.6195 18.2732 11.4451 5.2437 Minimum 0.0271 0.1264 0.1384 0.0951 0.0290 0.1587 9,1 Mean 1.7150 3.3604 4.5724 4.3456 2.3573 3.3578 Maximum 6.9899 11.6808 13.0940 12.1007 8.8499 11.6918 Minimum 0.9229 0.4358 0.4398 1.9464 0.6221 0.1138 4,8 Mean 20.7877 23.1791 24.3237 31.2432 7.2112 23.6290 Maximum 72.9576 53.6234 96.5198 117.8984 29.8139 55.7475 Minimum 0.0118 0.0016 0.0634 0.0148 0.0128 0.0017 4,6 Mean 0.3516 0.4411 0.7254 0.7013 0.3987 0.4545 Maximum 1.0799 1.2594 1.7699 1.9079 1.0734 1.3073 Minimum 0.0144 0.0111 0.0184 0.0160 0.0195 0.0064 6,8 Mean 0.4015 0.3784 0.6428 0.4161 0.6515 0.3843 Maximum 1.3556 1.1761 1.8772 1.4364 1.8961 1.1866 Minimum 0.0074 0.0029 0.0171 0.0264 0.0171 0.0025 7,10 Mean 0.4455 0.5021 0.5571 0.5618 0.4615 0.5039 Maximum 1.4874 1.4856 1.6193 1.7391 1.5814 1.4888 Minimum 0.0021 0.0084 0.0233 0.0238 0.0043 0.0094 2,5 Mean 0.3850 0.4356 0.3980 0.3970 0.4245 0.4387 Maximum 1.1637 1.5460 1.1363 0.8841 1.2134 1.5618 Minimum 0.0018 0.0027 0.0219 0.0519 0.0014 0.0021 2,4 Mean 0.3670 0.3805 0.4261 0.5033 0.2933 0.3849 Maximum 1.1054 1.2628 1.3027 1.3568 0.7467 1.2703 Minimum 0.0395 0.0033 0.0007 0.0005 0.0151 0.0089 3,5 Mean 0.3515 0.3929 0.4749 0.4371 0.3960 0.4015 Maximum 0.9650 1.1505 1.7526 1.3997 1.4373 1.1597 Minimum 0.0001 0.0006 0.0324 0.0121 0.0072 0.0008 7,9 Mean 0.3462 0.4638 0.4594 0.4300 0.4373 0.4659 Maximum 1.1872 1.5687 1.5015 1.2628 1.7652 1.5746 Minimum 0.0240 0.0265 0.0020 0.0071 0.0002 0.0201 8,10 Mean 0.3100 0.3516 0.4060 0.3843 0.3616 0.3515 Maximum 0.9245 1.0151 1.1746 1.1291 1.3811 1.0203

Figure 7: Calibration Error Analysis Table

Error(mm)

Error(mm)

35 0.52

Error in X Error in Y Error in Z

0.5

30

0.48 0.46 0.44

25

0.42 0.4 0.38

20

0.36 0.34 1500 Z(mm 1000 )

15

500 1000

0

10

1500 )

X(mm

500

−500 0

−1000 −1500

−500

5

Figure 11: Mean Error Distribution in Z Direction 0

1

2

3

4

5

6

7

8

9

10

11

12 Trial

Figure 8: Calibration Error Analysis

(a) Left Image

(b) Right Image

Error(mm) 0.75

Figure 12: Image Pair Taken by Camera Pair 1

0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 1500 Z(mm 1000 )

500 1000

0

1500 )

X(mm

500

−500 0

−1000 −1500

−500

Figure 9: Mean Error Distribution in X Direction

Error(mm) 0.65 0.6 0.55

larger than that of previous camera pairs. For camera pair 2, the error in X direction is 5.2807mm, in Y direction is 2.6116mm, and in Z direction is 1.9435mm. For camera pair 3, the error in X direction is 4.3456mm, in Y direction is 2.3573mm, and in Z direction is 3.3578mm. Actually, this is still good enough in the condition of our large meeting room and we can still use the results. For camera pair 1 including camera 9 and 3 and camera pair 4 including camera 4 and 8, we obtain large errors. For camera pair 1, the error in X direction is 19.0394mm, in Y direction is 34.6596mm, and in Z direction is 24.1948mm. For camera pair 4, the error in X direction is 31.2432mm, in Y direction is 7.2112mm, and in Z direction is 23.629mm. If we apply the calibration results to our 3D information extraction process, we will obtain large errors Here we give some reasons why we obtain large calibration errors for camera pair 1 and camera pair 4. Figure 12 shows the images taken by camera pair 1 and Figure 13 shows the images taken by camera pair 4. Camera pair 1 includes camera 9 and 3. As shown in figure 1, Camera 9 is supposed to see person D and person E. Camera 3 is supposed to see

0.5 0.45 0.4 0.35 0.3 0.25 1500 Z(mm 1000 )

500 1000

0

1500 )

X(mm

500

−500 0

−1000 −1500

−500

(a) Left Image

(b) Right Image

Figure 10: Mean Error Distribution in Y Direction Figure 13: Image Pair Taken by Camera Pair 4

person E and person F. The based line between camera 9 and camera 3 is too long and the position of the person E is not enough far, which cause either of the two cameras can only see one side well and the other side is too steep. When we detect the dots on the steep side, it is very difficult for us to obtain their accurate positions. Besides the images have some distortion because of the camera installation angles. All of these cause the large calibration errors. On the other hand, camera pair 1 is very important because it is supposed to see the person E who is key person in the meeting. The situation of the camera pair 4 is similar as the camera pair 1. Camera pair 4 includes camera 4 and camera 8. Camera 4 is supposed to see person H and person A. Camera 8 is supposed to see person A and person B. Camera pair 4 and camera pair 1 are symmetry. By the computation and analysis above, we obtained the results for current room and camera configuration. In the results, we got large calibration errors for camera pair 1 and camera pair 4. If we allow these errors in our data acquisition, we can use current configuration. Otherwise, we can improve room and camera configuration by adjusting camera orientations and positions especially for camera pair 1 and 4 and other properties and do multiple camera calibration again until the calibration accuracy is good enough. We can obtain satisfying meeting room and camera configuration

[3]

[4]

[5]

[6]

[7]

5. CONCLUSION Meeting room configuration and camera calibration are very important for us to capture 3D data to analyze meeting events in video based meeting analysis. In order to study subject’s gestures and gaze activity in the meeting, we need track head orientation, hand motion, and posture for all participants. Camera calibration can give us orientations, positions, and parameters of all cameras. With these, we can track subject’s activity in three dimensions. In our meeting room, we use 10 cameras to record the meeting events. They are distributed in the entire meeting room. In order to obtain 3D data from the video, we calibrate all cameras, register everything in the same global coordinate system, and build 3D model for subject’s hand tracking, head tracking. In our work, we use two steps to calibrate these widely distributed cameras. First we create 12 camera pairs with these 10 cameras and apply Tsai’s algorithm to calibrate each camera pair in its own local coordinate system. Then use Vicon motion capture system to get 3D positions of the calibration box and create translation and rotation matrices to transfer all local coordinate systems on calibration box into a global coordinate system in the meeting room. With these transformations, we can obtain 3D data when we perform hand tracking, head tracking, and body tracking. Our approach also includes calibration error analysis. We can obtain the error distribution in the entire meeting room. The error distribution can show us in which area we will get small error and in which area we will get large error, so that we can improve our meeting room configuration.

6. REFERENCES [1] Francis Quek and Yingen Xiong, “Oscillatory gestures and discourse,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP2003, Hong Kong, 2003. [2] Yingen Xiong and Francis Quek, “Gestural hand motion oscillation and symmetries for multimodal

[8]

[9]

[10]

discourse: Detection and analysis,” in 1st IEEE Workshorp on Computer Vision and Pattern Recognition for Human Computer Interaction (CVPRHCI), Monona Terrace Convention Center, Madison, Wisconsin, 2003. Yingen Xiong, Francis Quek, and David McNeill, “Hand motion gestural oscillations multimodal discourse,” in ACM Fifth International Conference on Multimodal Interfaces, ICMI2003, Vancouver B. C., Canada, 2003, pp. 132–139. Paul P. Maglio, Teenie Matlock, Christopher S. Campbell, Shumin Zhai, and Barton A. Smith, “Gaze and speech in attentive user interfaces,” in The International Conference on Multimodal Interfaces, ICMI2000, Beijing, China, 2000. Elisa N. Lawler and Zenzi M. Griffin, “Gaze anticipates speech in sentence-construction task,” in The 15th Annual Conference of the American Psychological Society, GA, USA, 2003. P. Baker and Y. Aloimonos, “Complete calibration of a multi-camera network,” in Proceedings of IEEE Workshop on Omnidirectional Vision, 2000, pp. 134–141. P. Baker and Y. Aloimonos, “Calibration of a multicamera network,” in Proceedings of the 2003 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW03), 2003, pp. 1–8. Joao P. Barreto and Kostas Daniilidis, “Wide area multiple camera calibration and estimation of radial distortion,” in Proceedings of the Workshop on Omnidirectional Vision and Camera Networks(OMNIVIS 2004), Prague, Czech Republic, 2004, pp. 1–5. R. Y. Tsai, “An efficient and accurate camera calibration technique for 3d machine vision,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami Beach, FL, 1986, pp. 364–374. R. Y. Tsai, “A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses,” IEEE Journal of Robotics and Automation, vol. RA-3, no. 4, pp. 323–344, 1987.