880
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 5, OCTOBER 2005
A Multicamera Setup for Generating Stereo Panoramic Video Stavros Tzavidas and Aggelos K. Katsaggelos, Fellow, IEEE
Abstract—Traditional visual communication systems convey only two-dimensional (2-D) fixed field-of-view (FOV) video information. The viewer is presented with a series of flat, nonstereoscopic images, which fail to provide a realistic sense of depth. Furthermore, traditional video is restricted to only a small part of the scene, based on the director’s discretion and the user is not allowed to “look around” in an environment. The objective of this work is to address both of these issues and develop new techniques for creating stereo panoramic video sequences. A stereo panoramic video sequence should be able to provide the viewer with stereo vision at any direction (complete 360-degree FOV) at video rates. In this paper, we propose a new technique for creating stereo panoramic video using a multicamera approach, thus creating a high-resolution output. We present a setup that is an extension of a previously known approach, developed for the generation of still stereo panoramas, and demonstrate that it is capable of creating high-resolution stereo panoramic video sequences. We further explore the limitations involved in a practical implementation of the setup, namely the limited number of cameras and the nonzero physical size of real cameras. The relevant tradeoffs are identified and studied. Index Terms—Circular projections, panoramic video, stereo vision, video signal processing.
I. INTRODUCTION AND PREVIOUS WORK A. Panoramic Imaging Applications
A
panoramic image can be defined as an image that contains all the required information so that the whole 360 degrees field-of-view (FOV) in any direction is covered. We can divide the panoramic imaging applications into the following four groups according to the amount of information they offer to the user. i) Still Panoramic Applications: In this case, for every direction of view the user is presented with a single still image. Still panoramic images are typically being created by taking a set of images of a scene using a rotating camera, and by projecting these images onto a common surface (typically a cylinder [3]–[5] or a sphere [6], [7]). Manuscript received November 2, 2002; revised June 10, 2004. This work was supported in part by the Motorola Center for Communications at Northwestern University. This paper was presented in part at the 2002 SPIE Conference on VCIP, San Jose, CA, June 2002 and the IEEE International Conference on Image Processing, Rochester, NY, September 2002. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Radu Serban Jasinschi. S. Tzavidas was with the Northwestern University, Evanston, IL 60208 USA. He is now with the Performance Analysis Department, Motorola, Inc., Arlington Heights, IL 60004 USA (e-mail:
[email protected]). A. K. Katsaggelos is with Northwestern University, Evanston, IL 60208 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TMM.2005.854430
The merging of multiple images taken with a fish-eye lens has also been proposed [8]. ii) Still Stereo Panoramic Applications: Here the goal is to provide the user with a pair of still images for every direction, one for each eye, in order to create depth perception. The rotating camera paradigm has been extended to include the generation of stereo panoramic images [5]. Two cameras with known relative geometry are being rotated with the axis of rotation passing through the center of projection of one of the cameras. Techniques for generating stereo panoramic views using only one rotating camera have also been proposed [9]. In these techniques, a special type of multiviewpoint projections, called circular projections, is used. The setup we describe in this paper is an extension of this technique. Circular projections and the method to obtain still stereo panoramic pictures with a single rotating camera are described in more detail in Section I-B. iii) Panoramic Video: The virtual reality system in this case has to be able to provide the user with a video sequence, instead of a single still image, in every direction. One approach is to use a single camera with a special type of mirror attached to it [10], [11]. Special care needs to be taken in the construction of the mirror, which needs to have certain mathematical properties [12]. An alternative solution is to use multiple cameras in fixed relative positions, and create each frame in the sequence by appropriately stitching the resulting images [13]. iv) Stereo Panoramic Video: In this case, the user is able to pick a direction of view in the environment and, for that direction, a pair of video streams, one for each eye, is received. One of the first approaches to include stereo capabilities in a panoramic video application was the use of two panoramic video cameras (cameras with a special mirror attached to them), one above the other, thus providing two viewpoints [14]. However, since the disparity in this case is vertical, it can only be used for depth calculation. Human viewers can only use horizontal disparity for depth perception. The only other known attempt to provide stereo panoramic views at video rates is with the use of a single camera with a special type of mirror performing circular instead of perspective projections [15]. However, in this case the mirror limits the FOV of the camera to a value less than 360 degrees. Initial efforts to design a special lens that would produce the same effect when attached to a regular camera are also presented in [15].
1520-9210/$20.00 © 2005 IEEE
TZAVIDAS AND KATSAGGELOS: MULTICAMERA SETUP FOR GENERATING STEREO PANORAMIC VIDEO
Fig. 1. (a) Regular perspective projection. World points P and P are projected through rays passing through a single viewpoint V and (b) multiviewpoint projection. Different world points are projected through different viewpoints. World point P is associated with viewpoint V for i = 1, 2, 3.
In this paper, we present a new multicamera setup that is capable of creating the desired views (360-degree FOV with horizontal disparity) at video rates, and analyze the trade-offs involved in its construction. The rest of this paper is organized as follows. In Section I-B, we describe in detail the method that was developed in [9] for creating stereo still stereo panoramic images, since our setup is based on the same principles. In Section II, we describe the proposed setup. In Section III, we analyze the setup and mathematically derive the trade-offs involved in a practical implementation. In Section IV, experimental results are presented and conclusions are drawn in Section V.
881
Fig. 2. (a) Circular projections: world points P and P are projected onto the image surface by using the rays tangent to the viewing circle. Viewpoints R and R are associated with world points P and P , respectively. (b) Circular projections and stereo panorama: every world point P is pictured through two different viewpoints for the left and right panoramic images. Viewpoint R is used for the construction of the right panoramic image and viewpoint L for the construction of the left.
B. Circular Projections for Stereo Panorama In this section, we describe how a special type of multiviewpoint projections called circular projections can be used to construct still stereo panoramic images using a single rotating camera. This technique was originally developed by Peleg et al. [9]. We describe it here in some detail since the proposed approach for constructing a stereo panoramic video sequence builds on it. Ordinary cameras create images of their environment by performing perspective projection. With this type of projection, scene points are projected onto the image surface, or “image plane”, along projection lines passing through a single point, called the “center of projection” (COP) or the “viewpoint” of the camera [Fig. 1(a)]. The resulting images are called single-viewpoint, or perspective, images. Multiple viewpoint projections differ from perspective projections in that they use different viewpoints for different viewing directions [16], [17]. With thess type of projections, different world points are projected onto the image surface using different viewpoints [Fig. 1(b)]. In Fig. 2(a), we present a specific example of multiviewpoint projections, called circular projections. The image surface in this case is a cylinder and each world point is projected onto the image surface through a ray tangent to a circle, which is called the “viewing circle”. The center of the viewing circle coincides with the center of the cylindrical image surface. The viewpoint associated with every world point is located on the viewing circle and is determined by a ray that passes through the world point and is tangent to the circle. In fact, in Fig. 2(a), the rays tangent to the viewing circle are drawn in the counter-clockwise direction. Clearly one can construct a different image by using the
Fig. 3. (a) Single perspective camera is rotated around an axis behind the camera (the axis of rotation is perpendicular to the figure plane). (b) Perspective camera is replaced by its perspective projection model. C is the viewpoint of the camera. In circular projections the points lying on the ray R C are projected through viewpoint R . These points are projected by the regular camera onto a narrow strip on the left of the image plane. We can obtain the result of circular projections of these world points by taking the part of the image that lies in the intersection of the ray R C and the image plane of the perspective camera.
same image surface but projecting instead world points through the rays tangent to the viewing circle in the clockwise direction [left ray in Fig. 2(b)]. By using the rays tangent to the viewing circle in the two directions, we can construct two multiviewpoint images. The left panoramic image is constructed using the rays tangent to the viewing circle in the clock-wise direction, while the right panorama is constructed using the rays tangent to the viewing circle in the counter-clockwise direction. It has been shown in [9] that the pair of panoramic images constructed in this manner comprise a stereo pair. We can approximate circular projections by rotating a single regular (perspective) camera around an axis located behind the camera, as shown in Fig. 3(a). In this figure, we show the camera in two successive positions as it rotates around an axis located behind the physical camera. In Fig. 3(b), we have replaced the perspective camera by its perspective projection model, showing its image plane and its viewpoint. As mentioned earlier, the regular camera projects all word points to the image plane utilizing rays passing through its viewpoint. We can see from this figure
882
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 5, OCTOBER 2005
Fig. 4.
Proposed setup for creating stereo panoramic video.
that for the world points that lie in the direction of the ray tangent to the viewing circle passing through the viewpoint of the regular camera we can obtain the result of the circular projection by taking a narrow strip from the left part of the image that the regular camera produces. The perspective camera in each successive position will give us one narrow strip representing the result of the circular projections for world points lying in a specific direction (the direction defined by the ray tangent to the viewing circle that passes through the viewpoint of the regular camera). For world points in different directions or different depths, we will need to position the regular camera in different positions in order to obtain the result of the circular projections. Thus, by rotating the regular camera and collecting small strips from the left part of the images produced by the regular camera, we construct the panoramic image for the right eye. Similarly, by collecting small strips from the right part of the image, we construct the panorama for the left eye, thus producing a pair of stereo panoramic images. The width of the strips from the original images should ideally be very small (in the limit zero when dealing with continuous images or one pixel wide when dealing with discrete images). Also observe that each camera position can provide us with only one theoretically correct ray for each panorama (one for left panorama and one for the right, as defined by the two rays tangent to the viewing circle passing through the viewpoint of the regular camera). We therefore need a large number of samples (pictures from the rotating camera at different positions) in order to obtain the desired result. We will see in Sections II–IV that when attempting to create stereo panoramic video the number of samples is limited, resulting in a number of critical tradeoffs that we study in detail.
II. DESCRIPTION OF PROPOSED STEREO PANORAMIC VIDEO SETUP Any setup for generating panoramic views of dynamic scenes cannot have any moving parts since at any time instant we have to be able to cover the whole viewing space simultaneously. Also, at any time instant, the setup has to be able to produce a pair of high-resolution panoramic images, one for each eye. The setup we propose consists of multiple cameras mounted on
a rig with fixed geometry, as shown in Fig. 4. In this setup, all cameras look outwards from the center of the rig. In our approach, we also employ circular projections for the construction of each frame in the stereo panoramic video secameras in the rig, then quence. If we assume that we have at any time instant images of the environment at different directions are available simultaneously. Given the images from all cameras at any given time instance, we construct the left and right panoramic images by following the method that was presented in the previous section, i.e., by taking the appropriate strips from the left and right part of the images and pasting them together. We therefore see that we obtain the same result that a camrotating camera passing through the positions of the eras would produce, but we do so with no moving parts, thus enabling the construction of stereo panoramic images at video rates. Assuming that we have a very large number of cameras in the setup, and also ignoring for the moment the physical dimensions of each camera, the proposed setup is an alternative implementation of the rotating camera method but with no moving parts. We mentioned earlier that a large number of samples are needed in order to approximate the result of the circular projections with the rotating camera approach. The number of samples can be increased in the rotating camera case by decreasing the angular increment of the rotation. In our approach, however, since we cannot allow for any moving parts, we can only have a limited number of samples, which equals the number of cameras in the setup . We therefore need to investigate how the limited number of samples affects the accuracy of the result. In Sections III and IV, we analyze the setup and its parameters, including the number of cameras , and their impact on the perceivable quality of the resulting panoramic sequences. III. ANALYSIS OF PROPOSED STEREO PANORAMIC VIDEO SETUP In practice, only a finite number of cameras can be included in the rig. In fact, this number has to be as small as possible in order to minimize the cost of such a rig. Also, since each camera has certain physical dimensions, the larger the number of cameras used, the larger the size of the rig needed. In this section we analyze the limitations imposed by the limited number and the nonzero physical dimensions of the cameras in the rig, and discuss the resulting tradeoffs. A. Size of the Rig and Perspective Correction Since every real camera has nonzero physical dimensions, as the number of cameras in the setup increases, the circumference, and therefore the radius, of the rig increases in order to accommodate all the cameras. In order to calculate the required radius of the rig for a given number of cameras, we assume that the rig is a regular polygon, as shown in Fig. 5. Each camera occupies is the width of one side of this polygon. Let us assume that each physical camera. If we include cameras in the rig, then, according to Fig. 5, we have (1)
TZAVIDAS AND KATSAGGELOS: MULTICAMERA SETUP FOR GENERATING STEREO PANORAMIC VIDEO
883
Fig. 5. Relationship between the number of cameras and the radius of the rig. R is the radius of the rig, W is the physical width of the cameras and is the angle between successive cameras in the rig.
Fig. 6. Each image is warped onto a cylinder before taking the strip. The radius of the cylinder is equal to the focal length of the cameras and its center coincides with the center of projection of the camera.
For known
, the required radius
of the rig is given by (2)
From the above formula, it is clear that as the number of cam), the required raeras in the rig increases (we assume dius of the rig increases as well. Note that a rig is defined by the of cameras used type (physical dimension ) and number in it. Therefore, each pair of values ( , ) defines a different physical rig. As already mentioned, in the ideal setup, the width of the strips that we take from each image is very small (ideally zero). In a real implementation, however, the width of the strips is a function of the number of the cameras in the rig. In a practical scenario, we only have a limited number of images available, which results in a relatively large width of the strips (tens or hundreds of pixels). As a result, the perspective cameras, which implement circular projections, introduce visible distortions in the form of discontinuities. Objects appearing in consecutive strips may appear discontinuous because of the different parameters of the perspective projection performed by the real cameras, which have different viewpoints. In order to correct for these discontinuities we warp each image onto a cylinder before taking the strip, as shown in Fig. 6. B. “Blind” Areas One of the most important effects of having a limited number of cameras in the setup is that some parts of the environment are not imaged by any of the cameras in the rig, and therefore objects in these areas are not visible in the resulting panorama. We will refer to such parts of the scene as “blind” areas. In
Fig. 7. C and C are the viewpoints of two successive cameras in the rig. The two cameras are positioned at angle =2, on either side of the y -axis of the reference coordinate system. Arrows R and R depict the rays tangent to the viewing circle that pass through viewpoints C and C respectively. R is the radius of the rig and r is the radius of the viewing circle. (a) Direction of view tangent to the viewing circle is defined by angle 8. (b) Strip from each image is obtained by taking the part of the image that corresponds to angle on either side of the direction defined by angle 8. (c) Rightmost boundary of the area of the environment that is visible to the strip from the image from camera 1. (d) Leftmost boundary of the area of the environment that is visible to the strip from the image of camera 2.
this section, we study how the number of cameras and the physical size of the cameras affect the size of the blind areas. In Fig. 7, we show two successive cameras in the setup. The angle between the two successive cameras is given by (1). Our goal is to determine the part of the environment that is left uncovered as we move from one camera to the next. Referring to Fig. 7(a), the direction of view of the ray tangent to the viewing circle is defined by the angle , given by (3) is the rawhere is the radius of the viewing circle and dius of the rig. As already mentioned, before taking the strips, each image is warped by performing a cylindrical projection (as shown in Fig. 6). Let the width of the strip in the warped image [see Fig. 7(b)]. We obtain the strip by taking the part of be the projected on the cylinder image that corresponds to angle on either side of the direction defined by angle . Note that the projected image is not shown in Fig. 7(b) in order to simplify the figure. Only the image plane of the cameras, onto which the original (un-warped) images are formed, and the angle are shown. represents the leftmost boundary of the In Fig. 7(c), line part of the environment that is projected onto the strip by camera 1. Indeed, any world points lying to the left of this line will not is the line perpenbe projected on the strip by camera 1. . Similarly, as can be seen in Fig. 7(d), line dicular to line defines the rightmost boundary of the part of the environment that is projected onto the strip by camera 2. Indeed, world
884
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 5, OCTOBER 2005
points that are positioned to the right of line are projected is the line perby the perspective camera outside the strip. . pendicular to line Using Fig. 7(c), we can calculate the slope of line (angle ) and the length of the segment as follows:
(4) (5) Similarly, according to Fig. 7(d), the slope of the line (angle ) and the length of the segment are given by
(6) (7) Referring to Fig. 7, we have three possible choices for the angle that represent the width of the strips. : In this case, lines and intersect at a point in the lower half of the plane (negative y). In other worlds, as y increases, the two lines move further apart, thus leaving a growing part of the scene uncovered. This choice of is clearly not practical. : In this case, lines and become parallel, since, according to (4) and (6), their slopes become . When this happens, equal, that is, as can be seen in Fig. 7, there are no world points that get projected onto both strips. In other words, the two successive strips have no overlapping regions. : In this case, lines and intersect at a point in the upper half of the plane (positive y). Therefore, the two successive strips overlap, and some objects appear in both strips. would be to use The first and most intuitive choice for , since such a choice would leave no part of the scene uncovered (at least beyond a certain minimum distance), and it would also result in overlapping successive strips, which would then have to be stitched and blended together. The main drawback of this option, however, is that the width of the overlapping region between successive strips is not fixed. When two cameras have different viewpoints, the amount of overlap between the images they produce depends on the distance of the objects in the scene from the cameras. In other words, different objects at different depths in a scene will have a different amount of overlap in successive strips. In such cases, we would have to use a registration algorithm similar to those used in earlier stereo panoramic imaging techniques [5], in order to determine point correspondences in the overlapping region. Such algorithms however are computationally costly and slow. In our approach, we are interested in creating panoramic images at video rates and we want to avoid computationally expensive steps. For . As a result, rays and this reason, we use become parallel and the area in between these two parallel lines
Fig. 8. FOV of the physical camera limits the value of 8. defines the width of the strip taken from the image.
is not visible by any of the two strips, henceforth called “blind strip”. We have one such blind strip created for each camera. The width of the invisible area is a measure of the discontinuities between the two successive strips from the two successive cameras. The larger the width of the blind area the larger the area left uncovered and the larger the discontinuities when strips from successive cameras are pasted together in the final panorama. Our goal is to determine the optimal set of rig parameters that minimize these discontinuities. Note that, even if one decides to introduce some overlap between successive strips and employ blending algorithms in order to blend the strips together, the optimal choice for the rig parameters would that minimize the be the same. The combinations of and width of the uncovered area would also minimize the amount of discontinuities that the registration and blending algorithms would have to correct. We will calculate the width of the blind strips as a function of the other parameters of the rig (number of cameras and physical size of cameras ). Notice that when lines and become parallel, points , and lie on the same line (the and ). Therefore, the distance line perpendicular to between the two parallel lines is given by
(8) where again is the radius of the rig and is the radius of the viewing circle. As can be seen from the above equation, if and are constant (therefore is also constant) and decreases, the width of the blind strips approaches zero in the limit (which is the ideal case of infinitely many, zero-width cameras). However, as mentioned earlier, increases as the number of cameras increases. Therefore, in a practical scenario, it is impossible to reduce without increasing the number of cameras and therefore increasing . Another important observation is that the horizontal FOV of the camera limits the possible values of . Referring to Fig. 8, we notice that has to be such that the direction of view that it specifies does not fall outside the image plane of the physical
TZAVIDAS AND KATSAGGELOS: MULTICAMERA SETUP FOR GENERATING STEREO PANORAMIC VIDEO
Fig. 9. Width of “blind strips for various numbers of cameras physical size of the cameras.
W
N and varying
camera, that is . More specifically, since it is required that the views corresponding to angles on either side of the theoretically correct view are available, the maximum , , is equal to denoted by (9) where FOV is the horizontal FOV of the perspective camera in degrees. If a certain combination of and , results in a value of in (2), which when substituted in (3) results in a value of that is less than , then we conclude that for the given FOV this combination of and is invalid. As a last requirement, we stipulate that the radius of the rig should always be larger than the radius of the viewing circle , since otherwise the rig cannot be used to simulate circular projections with this particular viewing circle size. Note here that is a parameter whose value is chosen early in the rig construction process. It determines the disparity between the left and right views in the stereo panoramic video sequence and thus represents the desired disparity in the resulting panorama. Once a particular setup has been built, a different value of can then be assumed in the construction of the stereo panoramic video as long as the new value does not increase the required FOV of the cameras (described by (10) in the following). In Fig. 9, the width of the blind strips is plotted, as given by (8), for various values of and . In this figure, we have assumed that the FOV of the physical cameras is equal to 55 (which is considered to be the FOV of a “normal” lens [18]). The is assumed to be in the range of 16 to width of the cameras . 42 mm, and the radius of the viewing circle is set to There are two important observations we can make about the graph in Fig. 9. The first one is that an increase in the number of cameras does not necessarily reduce the width of the blind strips. This is because an increase in the number of cameras decreases [(1)], but also increases the required [(2)], thus reducing [(3)]. The second one is that, the smaller the size of the rig the more cameras we need in order to get a valid setup, for a given
Fig. 10. Required FOV for various numbers of cameras physical size of the cameras.
W
885
N
and varying
FOV of the cameras. To see this, we solve (9) for FOV and after substituting from (3), we obtain (10) The above relationship is plotted in Fig. 10 for various values and . As can be seen from it, the required FOV inof decreases. For a fixed FOV (as in our example creases as ), we cannot use combinations of and with that require a larger FOV than that. This is illustrated in Fig. 9, by showing zero blind strip width until a critical number of cameras is reached, so that the resulting is small enough to “fit” into the given FOV of 55 . that gives In Fig. 9, we observe that the smallest number a radius of the rig big enough so that the given FOV is sufficient, also represents the best solution in terms of the width of the resulting blind strips. Increasing the number of cameras , decreases the required FOV (which is not that important if a certain fixed FOV is available), but actually increases (although not dramatically) the width of the blind strips. On the other hand, there is a clear advantage, with respect to the width of the blind strips, in reducing the size of cameras for a given . We see in Fig. 9 that for smaller cameras the critical number of cameras is larger, but the resulting blind strip width is smaller. As a and for the final observation note that, for the given range of and under consideration, the minimum required . number of cameras resulting in a valid setup is C. Disparity Variation The ideal setup (infinite number of cameras with small enough size so that adding more cameras does not increase the radius ) does not have a preferred direction of view; it is symmetrical with respect to all directions. This symmetry is actually destroyed due to the limited number of cameras in the actual setup. The main consequence of such lack of symmetry is a change in the disparity of objects in the resulting stereo panorama, as compared to the disparity that the ideal
886
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 5, OCTOBER 2005
Fig. 11. Angular disparity in the ideal setup for a point at depth Z . An infinite number of cameras is assumed, thus the rays tangent to the viewing circle are available for all directions. A point P at distance Z from the center of the rig is pictured through the two viewpoints using the rays tangent to the viewing circle. The radius of the viewing circle is r , and the resulting angular disparity is 2.
Fig. 13. Maximum percent disparity variation for different depths for various values of N and W . For each pair of values (N , W ) the disparity variation is calculated (using (11)–(13)) for a range of depths and the maximum is taken as a representative value and plotted here.
OP and are perpendicular. The angular disparity is now calculated as , where (12) where is given by (2), as before. For the given distance we can calculate the percentage change in disparity with respect to the ideal disparity as
Fig. 12. Angular disparity in a real setup for a point at distance OP = Z from the center of the rig. The only rays available in practice are those passing through the viewpoints of the real cameras C and C on the perimeter of the setup. R is the radius of the rig, is the angle between successive cameras in the setup and the resulting angular disparity is 2! .
setup would produce. This change in disparity, which we call disparity variation, depends on the distance of an object from the rig and is also different for objects at the same depth but at different directions, as is quantified next. Referring to Fig. 11, the angular disparity [19] of point is , where (11) In an actual setup the rays tangent to the viewing circle are not available for all directions and for all depths . As mentioned, the regular perspective cameras project a point in the scene through rays that pass through their viewpoints. Therefore, only the rays that pass through the viewpoints of the physical cameras are available in practice. In Fig. 12, we consider a point at depth and at a direction halfway between the and . Let be the viewpoints of two successive cameras intersection of the line OP with the line that connects the centers and . Observe that, due to symmetry, lines of projection
(13) where and are given by (11) and (12), respectively. For each setup (i.e., for each combination of and ), we calculate the disparity variation, as given by (13), for various values of the depth , and then keep the maximum disparity variation as a representative value that describes the performance with respect to disparity of this particular setup. This maximum disparity variation is plotted in Fig. 13 as a function and showing zero disparity of and , assuming combinations which are not valid for the utilized for and FOV value. In Fig. 13, we observe the same type of behavior that we saw when we studied the width of the blind strips (refer to Fig. 9). there is a critical number of cameras As before, for each that needs to be used in order to produce a setup that is valid for the given FOV. We see that adding more cameras, beyond this critical number, does not improve the performance of the rig, since the maximum disparity variation does not decrease. On the other hand, for a given number of cameras , there is a clear gain in going to cameras with smaller physical dimensions (smaller ); the maximum disparity variation is actually decreasing for smaller , when is constant. We also observe in Fig. 13 that the maximum disparity variation even in the best case is quite significant (20% in the best case, almost 50% in the worst). We want to investigate whether such a variation is
TZAVIDAS AND KATSAGGELOS: MULTICAMERA SETUP FOR GENERATING STEREO PANORAMIC VIDEO
887
through rays tangent to the viewing circle, as in the ideal case. for point is equal to The angular disparity (14) For a given distance from the center of the rig, a specific and and a specific “offset” (Fig. 14) we combination of calculate the disparity and the percentage change with re, which we call ( is spect to the disparity for given by (12)), as follows: Fig. 14. Angular disparity for different directions, for a point at depth Z . R is the radius of the rig, C and C are the viewpoints of two successive cameras on the rig. The angular disparity is given by angle C P C .
likely to produce a panorama that would be uncomfortable for a human viewer. It is known [19] that disparity is a cue that is used by the human eye to judge relative depths only. A human observer first makes an estimate of the absolute depth in a scene by using different cues, such as occlusion and relative size of objects, and only then uses disparity to make estimates about the relative depth of objects. Furthermore, when viewing a certain scene the eyes of the observer fixate at a certain depth, and use disparities to judge relative depths only for small depth differences with respect to the fixation point. In other words, the disparities that can be combined by the human brain in order to create depth perception are small, and they correspond to relatively small changes in depth. Very large disparities, that correspond to larger depth differences cannot be combined, and the objects are seen diplopic (i.e., two images are perceived). We saw earlier that the disparity variation is not the same for different values of the depth . However, for a small range of depths we can assume that all corresponding disparities change approximately the same with respect to the ideal. In other words, their relative sizes remain approximately the same. Since disparity is used by humans to make only relative depth estimates, even large amounts of disparity variations like the ones shown in Fig. 13, are not likely to produce noticeable differences for human observers. Note however that if the stereo panorama were to be used to calculate three-dimensional (3-D) information based on disparity, then the disparity variation is likely to produce errors in the final result. Following similar steps to the ones above we now analyze the disparity variation for points at the same depth but at different directions. In Fig. 14, we show two points and at depth from the center of the rig, as well as the viewpoints of two and . Point is at a direction of view successive cameras that falls halfway between two successive cameras, while point is at the same depth but at a different direction, to the right of point . The angle between the two directions is called “offset” in the figure. Ideally, the angular disparity for points and should be the same, but as we can see in Fig. 14(a) this is not true in a real setup. are projected As shown in Fig. 14, the two points and and and not through lines passing through the viewpoints
(15) For each pair of values and and for a specific depth , we calculate the disparity variation from (15) for values of . We choose to keep as a the angle “offset” from 0 up to and the maximum representative value for each pair of value of the disparity variation among all offsets. We refer to this value as maximum disparity variation for different offsets and and for the specified for the specific pair of values depth . In Fig. 15, we present plots of the maximum disparity variation for different offsets as a function of the number of cameras , for a number of values of the depth . Three different curves are shown in each plot corresponding to three representative values of the physical size of the camera . A zero value is shown for the combinations of and that result in a rig radius , which is smaller than the assumed radius of the viewing circle . In the plots of Fig. 15 we see that, as before, smaller values of tend to produce smaller disparity variations. However here we do have a clear gain when adding more cameras in the rig. We see that in all cases, an increasing tends to reduce the percent disparity variation. We also observe that, as increases, the distance between the different curves becomes smaller, thus making less significant the effect of reducing , and the maxand imum disparity variation for any given combination of becomes smaller. We saw in the previous sections that for all values of under consideration the absolute minimum number of cameras needed larger than 21, the reis 21. We can see in Fig. 15 that for sulting maximum disparity variation is less than 2%. Since the disparities that can be combined by the human brain in order to create depth perception are small, the resulting variation of 2% is also small and will not result in a noticeable variation in perceived depth by human observers. We therefore conclude that the lack of symmetry of an actual setup produces some variation in disparity, especially for small distances from the rig, but this variation is not likely to produce any visible effects in the resulting panoramic image. IV. EXPERIMENTAL RESULTS A number of simulations were performed for various combinations of the number of cameras and the physical width of the cameras , in order to validate the analytical results derived in this paper. In this section we present some representative examples of the simulations that we performed.
888
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 5, OCTOBER 2005
N
W
Fig. 15. Maximum disparity variation for different offsets for various combinations of and and distances from the rig. The disparity variation is calculated as a percentage of the disparity for offset equal to 0. In each graph the distance from the rig is fixed. For each pair of values ( , ) the disparity variation is calculated [using (12), (14), and (15)] for a range of offsets (defined in Fig. 14) and the maximum is taken as a representative value and is plotted here.
Fig. 16.
Resulting stereo panorama for
Z
NW
N = 32 and W = 42 mm. (a) Left panorama. (b) Right panorama.
A model of the setup was developed in 3-D Studio MAX, a 3-D modeling and animation software environment developed by Discreet Corporation (www.discreet.com). Ideal pinhole cameras were used. In the examples shown in this section, NTSC cameras with a focal length of 5.6 mm and were used. For various combinations of and , we constructed a 3-D model of the resulting setup, placed it in a virtual environment, and stereo panoramic views of that environment were constructed. The virtual environment we used consists of a 9.1 m 6.1 m room that contains objects in various distances from the rig. The minimum distance of the objects in the room from the rig is 1.75 m and the maximum distance is 4.4 m. for In our algorithms, we assumed a radius of the viewing circle. It is this radius that controls the disparity between the left and right views in the resulting panorama. In Fig. 16, we show one frame in the panoramic video sequence (which is itself a pair of panoramic images), produced cameras each one with a physby a rig consisting of . The resulting radius of the rig [as ical width of
. In order to assess the impact given by (2)] is of the size of the blind strips on the perceivable quality of the resulting panoramic video sequence, we introduced in the environment a motion pattern that spans multiple blind strips. In particular, the human figure in the right part of the environment in Fig. 16 moves toward the bookcase to its right and back. The stereo panoramic video sequences were constructed for three different setups and the part of the panorama that contains the motion was selected and rendered into a pair of video streams one for each eye. In Fig. 17, we present three frames from the video sequences, each one taken from the sequences that three different setups produce. The frames presented here were taken from the right-eye video sequence at the same time instant for all three setups. The frame presented in Fig. 17(a) corresponds to a and . Note that is the setup with minimum number of cameras needed for the given value of , according to Fig. 9. The width of the blind strips for this combination of and is 4.0 cm. It can be seen in this figure that the blind strips introduce significant discontinuities that are quite
TZAVIDAS AND KATSAGGELOS: MULTICAMERA SETUP FOR GENERATING STEREO PANORAMIC VIDEO
Fig. 17.
W
889
One frame in the right channel of the resulting stereo panoramic sequence for three different setups: (a)
= 42 mm, and (c) N = 43, W = 16 mm.
uncomfortable for a human viewer (observe the discontinuities at the location of the hands of the human figure). In Fig. 17(b), we increase the number of cameras to 32. As shown earlier, by increasing we do not reduce the width of the blind strips (in fact, the width of the blind strips is slightly increased to 4.1 cm). Furthermore, in this case, we have a larger number of blind strips, caused by the larger number of cameras. We can see in Fig. 17(b) that now a smaller part of the human figure “fits” in between the blind strips, thus making the discontinuities more noticeable. We therefore see that for a given , the minimum number of cameras represents the best possible choice, and that increasing the number of cameras in the setup does not improve the quality of the result, as analyzed earlier. In Fig. 17(c) we reduce to 16 mm. The minimum required for and is 43, and that is the value used in Fig. 17(c). In this case the number of blind strips actually increases, but their width is reduced to 1.5 cm. We can see that, in this case, even though we have a larger number of discontinuities, their effect is much less noticeable and the resulting panorama is much more comfortable. We looked at the resulting panoramas using the color-coded glasses method (anaglyphs) in order to determine whether the resulting panoramas for various setup configurations (combinations of and ) were comfortable in terms of perceived disparity. As already mentioned, any variation in disparity would cause the relative depths of the objects in the scene to appear different than in reality. In particular, the perceived relative depths of the objects in the scene when looking at a fixed direction of view would appear different in the panoramas produced by different setups due to the disparity variation described by (13), thus making a scene appear more flat (smaller relative depths) or less flat (larger relative depths). Furthermore, the effect of the disparity variation described by (15) should be more evident when the viewer changes direction of view in a given panorama. Since the disparity variation varies according to the direction of view, there should be discontinuities in the perceived depth of the objects in the scene. Both kinds of disparity variation when perceived should make the panorama uncomfortable for a human viewer. Our conclusion from these observations is that the resulting panoramas are comfortable for the observer. The comprehensive subjective evaluation, however, of the resulting video panoramas is rather elaborate and beyond the scope of this paper. The objective of this paper is to mathematically analyze the important tradeoffs of the proposed setup, based on which the most appropriate setup parameters can be chosen for the particular application. As a final note, we would like to describe what we would call the “big head” effect. For a particular rig (a particular combination of and ), all the world points that lie inside the rig
N
= 21,
W
= 42 mm, (b)
N
= 32,
are invisible in the resulting panorama and therefore invisible to the user. In fact, the user perceives the points inside the rig as lying behind his eye level (i.e., inside his head). Moreover, the perceived eye level is the level where the actual cameras are in the physical setup. When changing the direction of view, the user’s “eyes” rotate on a surface of radius , the radius of the physical rig. Therefore, a larger rig leaves a bigger part of the scene behind the actual cameras and also increases the radius of the surface onto which the users “eyes” rotate, thus creating the perception of having a larger head. When changing the direction of view in the resulting panorama, bigger rigs create an unnatural feeling associated with the large part of the scene that seems to always lie behind the viewer’s eyes. This effect becomes which result much less severe for the combinations of and in smaller rigs, thus offering one more incentive for selecting the minimum possible number of cameras of the smallest size possible. V. CONCLUSIONS In this paper, we explored the problem of generating stereo panoramic views of dynamic scenes at video rates. We proposed a setup that consists of several cameras mounted on a rig with fixed geometry. This setup is an extension of a previous approach developed for stereo panorama of still environments, and is based on the theory of circular projections. We showed that the important parameters defining a setup are the number of cameras included in it and the physical width of each camera. The practical limitations of an actual implementation were identified. Finally, the tradeoffs involved in a setup with a finite number of cameras and a nonzero physical size of the cameras were studied. The main conclusions that can be drawn from our analysis are summarized as follows. For a given type of camera (given FOV and physical dimensions), there are a minimum number of cameras needed to produce a valid setup. This minimum number is controlled by the FOV of the physical cameras used. The minimum number of cameras is also the optimal choice. Adding more cameras to the setup does not improve its performance in terms of visible area and in terms of the discontinuities introduced in the resulting panorama. An increase in the number of cameras does improve the symmetry of the rig for different directions, but this improvement does not produce a significant difference in the perceived depth in the resulting panorama. It was demonstrated that there is a clear advantage in making the physical size of the cameras smaller. For a given number of cameras, a reduction in the physical size of the cameras results in a smaller rig, and also reduces the size of the areas in the environment that are not visible by the setup. We also saw that when
890
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 5, OCTOBER 2005
the size of the cameras becomes smaller, we obtain a smaller disparity variation in the resulting stereo panoramic video. Our setup was tested using data from ideal (pinhole) cameras, placed in a synthetic environment. Pinhole cameras do not produce any distortions except for the distortion resulting from the perspective projection they perform. We saw that we can correct this distortion by cylindrical projection. Pinhole cameras have infinite depth-of-field, i.e., all objects in the scene are in focus, which is not true for real cameras. The effect of out-of-focus objects in the resulting panoramic views has to be investigated. The out-of-focus areas in the images have to be detected and handled appropriately in the panoramic image construction process. Also, in a real scenario, the lighting conditions usually vary slightly from one image to the next, due to errors in synchronizing the parameters of all the cameras in the setup. When two images are pasted together in the final panorama, there might be a noticeable change of luminance and color in many cases. Algorithms that seamlessly blend two consecutive images together in real time need to be investigated. As a final note, observe that a stereo panoramic video sequence consists of 30 stereo panoramic views per second. This represents a large amount of data and therefore new compression techniques need to be developed for storage and transmission purposes. Extensions of the existing standards in order to handle panoramic and stereo panoramic video sequences are required for this purpose. REFERENCES [1] S. Tzavidas and A. K. Katsaggelos, “Multicamera setup for generating stereo panoramic video,” in Proc. 2002 SPIE Conf. VCIP, San Jose, CA, Jan. 2002. [2] , “Disparity variation in stereo panoramic video,” in Proc. IEEE Int. Conf. Image Processing, Rochester, NY, Sep. 2002. [3] S. E. Chen, “Quicktime VR—An image based approach to virtual environment navigation,” in Proc. f ACM SIGGRAPH, 1995. [4] R. Szeliski, “Video mosaics for virtual environments,” IEEE Comput. Graph. Applicat., pp. 22–30, Mar. 1996. [5] H.-C. Huang and Y. P. Hung, “Panoramic stereo imaging system with automatic disparity warping and seaming,” Graph. Models Image Process., vol. 60, no. 3, pp. 196–208, May 1998. [6] R. Szeliski and H. Y. Shum, “Creating full view panoramic image mosaics and environment maps,” in Proc. ACM SIGGRAPH, 1997. [7] S. Gumustekin and R. W. Hall, “Mosaic image generation on a flattened Gaussian sphere,” in Proc. IEEE Workshop on Applications of Computer Vision, 1996, pp. 50–55. [8] Y. Xiong and K. Turkowski, “Creating image-based VR using a self-calibrating fisheye lens,” in Conf. Computer Vision and Pattern Recognition, San Juan, PR, Jun. 1997, pp. 237–243. [9] S. Peleg and M. Ben-Ezra, “Stereo panorama with a single camera,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Jun. 1999. [10] Y. Onoe, K. Yamazawa, H. Takemura, and N. Nokoya, “Telepresence by real-time view-dependent image generation from omnidirectional video streams,” Comput. Vis. Image Understand., vol. 71, no. 2, pp. 154–165, Aug. 1998. [11] S. K. Nayar, “Omnidirectional video camera,” in Proc. DARPA Image Understanding Workshop, New Orleans, LA, May 1997. [12] S. K. Nayar and S. Baker, “Catadrioptic image formation,” in Proc. DARPA Image Understanding Workshop, New Orleans, LA, May 1997. [13] V. Nalwa, “A True Omnidirectional Viewer,” Bell Laboratories, 1996. [14] J. Gluckman, S. Nayar, and K. Thoresz, “Real-time omnidirectional and panoramic stereo,” in Proc. DARPA Image Understanding Workshop 1998, Monterey, CA, Nov. 1998, pp. 299–303.
[15] S. Peleg, Y. Pritch, and M. Ben-Ezra, “Cameras for stereo panoramic imaging,” in Proc. CVPR’00, vol. I, Hilton Head Island, SC, Jun. 2000, pp. 208–214. [16] P. Rademacher and G. Bishop, “Multiple-center-of-projection images,” in SIGGRAPH, New Orleans, LA, Jul. 1998, pp. 199–206. [17] D. Wood, A. Finkelstein, J. Hughes, C. Thayer, and D. Salesin, “Multiperspective panoramas for Cel animation,” in SIGGRAPH, Los Angeles, CA, Aug. 1997, pp. 243–250. [18] M. Freeman, The 35 mm Handbook, A Complete Course from Basic Techniques to Professional Applications: Windward, 1980. [19] T. O. Salmon. Vision Science IV: Binocular Vision. Northeastern State Univ., Tulsa, OK. [Online]. Available: http://arapaho.nsuok.edu~salmonto/BinocularVision [20] S. Tzavidas, “A New Setup for Generating Stereo Panoramic Video,” M.S. thesis, Dept. Elect. Comput. Eng., Northwestern Univ., Evanston, IL, 2001.
Stavros Tzavidas received the Bachelor’s degree in electrical and computer engineering from the National Technical University of Athens, Athens, Greece, in 1998 and the M.S. degree in electrical and computer engineering from Northwestern University, Evanston, IL, in 2001. In 2001, he joined the Performance Analysis Department, Networks Business, Motorola Inc., Arlington Heights, IL, where he is currently a Senior Research Engineer, working in the area of wireless data networks. He has published papers in many research areas, such as computer animation and virtual reality, stereo panoramic video, and performance of wireless data networks. His current research interests include multimedia and wireless communications, video processing, and stereo panoramic imaging applications.
Aggelos K. Katsaggelos (F’98) received the Diploma degree in electrical and mechanical engineering from the Aristotelian University of Thessaloniki, Thessaloniki, Greece, in 1979 and the M.S. and Ph.D. degrees in electrical engineering from the Georgia Institute of Technology, Atlanta, in 1981 and 1985, respectively. In 1985, he joined the Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, where he is currently a Professor. He was the holder of the Ameritech Chair of Information Technology (1997–2003). He is also the Director of the Motorola Center for Communications and a member of the Academic Affiliate Staff, Department of Medicine, at Evanston Hospital. He is the editor of Digital Image Restoration (New York: Springer-Verlag, 1991), coauthor of Rate-Distortion Based Video Compression (Norwell, MA: Kluwer 1997), and co-editor of Recovery Techniques for Image and Video Compression and Transmission (Norwell, MA: Kluwer, 1998). He is the co-inventor of ten international patents. Dr. Katsaggelos is a member of the Publication Board of the PROCEEDINGS OF THE IEEE, the IEEE Technical Committees on Visual Signal Processing and Communications, and Multimedia Signal Processing, the Editorial Board of the Academic Press, Marcel Dekker: Signal Processing Series, Applied Signal Processing, and Computer Journal. He has served as Editor-in-Chief of the IEEE Signal Processing Magazine (1997–2002), a member of the Publication Boards of the IEEE Signal Processing Society, the IEEE TAB Magazine Committee, an Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING (1990–1992), an Area Editor for the journal Graphical Models and Image Processing (1992–1995), a member of the Steering Committees of the TRANSACTIONS ON IMAGE PROCESSING (1992–1997) and the IEEE TRANSACTIONS ON MEDICAL IMAGING (1990–1999), a member of the IEEE Technical Committee on Image and Multi-Dimensional Signal Processing (1992–1998), and a member of the Board of Governors of the IEEE Signal Processing Society (1999–2001). He is the recipient of the IEEE Third Millennium Medal (2000), the IEEE Signal Processing Society Meritorious Service Award (2001), and an IEEE Signal Processing Society Best Paper Award (2001).