Digesting omni-video along routes for navigation - IUPUI Computer ...

2 downloads 0 Views 1MB Size Report
scene variation in the view mainly caused by motion parallax, which is further related to ..... 986–995. [11] Zheng J. Y., Shi M., Scanning depth of route panorama.
Digesting Omni-video Along Routes for Navigation Jiang Yu Zheng

Hongyuan Cai

Indiana University & Purdue University Indianapolis 723 W. Michigan St. Indianapolis, IN46202, USA

Indiana University & Purdue University Indianapolis 723 W. Michigan St. Indianapolis, IN46202, USA

[email protected]

[email protected] focus on the motion and visibility levels in order to generate distinct views. The resulting view sequences can be used for driving and walking, and are efficient for selecting symbolic landmarks with GIS data further.

ABSTRACT Omni-directional video records complete visual information along a route. Though replaying an omni-video presents reality, it requires significant amount of memory and communication bandwidth. This work extracts distinct views from an omni-video to form a visual digest named route sheet for navigation. We sort scenes at the motion and visibility level and investigate the similarity/redundancy of scenes in the context of a route. We use source data from 3D elevation map or omni-videos for the view selection. By condensing the flow in the video, our algorithm can generate distinct omni-view sequences with visual information as rich as the omni-video for further scene indexing and navigation with GIS data.

General image and video encoding and indexing do not consider the scene continuity along a path. View changes related to spatial layout and scene distribution, which are not the best fit to omnivideo compression and digest. Works related to the route scene description include generation of panorama video along streets [12,13] and indoor paths [2] using omni-directional cameras. The data size is huge since an omni-video is several times as large as normal video. Aliaga et al compressed the neighboring views at signal level [12] that removes data redundancy according to spatial frequency in 2D images. Cai et al have modeled the view significance based on the visibility and scene distance [7]. For navigation, separated views of an object with different shapes may still be considered as redundant since the data are for human perception. Route scene description have been extracted in sideway directions [8,9,10,11,17], which greatly reduced the data size of street views. In addition, Degener et al have condensed path views in the forward direction [6] at object level. In generally, by analyzing the similarity of scenes and their semantic meaning, unique scenes are extracted as landmarks for route digests in qualitative navigation [9].

Categories and Subject Descriptors E.4 [Data] Coding and Information Theory - Data compaction and compression. H.5.1 [Information Interfaces and Representation]: Multimedia Information Systems - video

General Terms Algorithms, Measurement, Design, Experimentation.

Keywords Level-of-details, omni-video, compression, digest, visibility, motion parallax, navigation, cyber space

To represent scenes effectively and efficiently, our digest of omni-video for virtual navigation along a path is based on the scene variation in the view mainly caused by motion parallax, which is further related to the scene depth and camera movement. We examine the view difference and provide distinct views along a path by using a motion equalization algorithm. This digest with much smaller in size than a video ensures a certain amount of motion in consecutive frames. A more compact route sheet containing critical scenes (e.g., landmarks) at turning locations will be generated from the digest to guide the navigation. Compared to related works on compression methods of panoramas at lower signal level, our method is at the visibility and motion level particularly for moving camera paths.

1. Introduction Omni-video along a path provides complete information. With omni-sensors [1,2,3,4,5,15], complete archives of routes have become possible. A large database has been created in Google StreetView by driving vehicle along every street in a city [19]. More and more omni-videos are being created by local users along trails and in indoor environments. While this has increased the continuity of route scenes, the information is heavily redundant in the omni-video due to the camera positions, scene distances and complexity. For navigation, it is necessary and helpful to create a compact digest that maintains contextual information along a route.

In the following, we first address the criteria of the path scene digests in Sec. 2, and then derive the most exposed path in Sec. 3 that guarantees the minimum scene changes along a route. Section 4 will describe the view extraction from omni-video. Experimental results will be given in Sec. 5.

This work investigates a new task ⎯ omni-video digesting. We sort the visual information in omni-directional videos at a more qualitative level than video signal itself. A limited number of representative frames will be extracted from omni-video to form a route sheet. At various levels of details in representing scenes, we

2. Criteria for Omni-video Digest Comparing to an overhead image, ground-based images capture more vertical surfaces and details in an area. An omni-video can be taken when the camera moves along a certain path with several types of omni-cameras such as fisheye camera [14,16], ladybug camera with spherical view [4] and mirror reflection camera with a cutoff field of view at top [2,15]. Usually a fisheye lens only needs one camera. The ladybug camera is composed of several cameras facing outward. In this paper, we use video from an

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’10, October 25–29, 2010, Firenze, Italy. Copyright 2010 ACM 978-1-60558-933-6/10/10...$10.00.

647

Exp = min D {∫ COST ( P) ds}

upward fish-eye that ignores moving objects at lower heights and trivial road areas without many location specific features, as shown in Figure 1.

(a)

D

(1)

among all possible paths D. The COST at each position is calculated as max(σ)- σ(P), roughly corresponding to sum of sky and ground areas in the omni-video frame. Using Dijkstra's algorithm [21], the path extended to every reachable position is computed from a starting point. We display a vector field of the shortest path in Figure 3a, where four colors show the directions to trace back to the starting point if the destination is given. Figure 3b shows paths to three arbitrary destinations from the same start position.

(b)

(c) Figure 1 Fish-eye omni-view in an urban area. (a) LiDAR elevation data (intensity indicates height and white is the lowest elevation), (b) omni-views from upward fish-eye camera, and (c) depth map of an omni-view displayed in intensity (the darker the value, the closer the depth).

Figure 2 View significance shown in the intensity measured at every position of free spaces in an urban area. We can notice that the view significance is low (dark) around trees (small dots) and buildings and high (bright) at open places viewing a lot of scenes.

Our criteria to digest route omni-video are (1) assign sufficient number of views to maintain continuity of scenes so that the virtual navigation based on images will not get lost, and (2) assign fewer images to the path segments with less change, and no repeated images to invariant scenes. Therefore, our viewpoints assigned along a path will not have an equal interval. In the omni-video, we will pick up views with less scene redundancy by examining scene overlaps. The changes of scenes in route video largely depend on the depth; the farther the scenes, the longer it stays in the view. Two steps will be taken to ensure less scene redundancy and a fewer number of views while maintaining the view continuity. The first step is to have large visibility or long viewing distance to include as many scenes as possible for georeferencing, and the second is the viewpoint selection according to scene changes or motion parallax.

(a)

(b)

Figure 3 The most exposed paths from a starting position at lower left corner. (a) Field of the path vectors to every position in free spaces. Directions: blue-north, yellow-east, red-west, green-south, and white-buildings. (b) The most exposed paths from a starting position to three destinations.

3. Finding the Most Exposed Path In the 2D free spaces of an urban area, a variety of paths can be planned to reach a destination. Among them, the most hidden path [18] along the corner of buildings can have its omni-views full of changes; the building details shift fast in the video frame due to close depths. Now, our first approach is to find a path called the most exposed path in the free street space that has the minimum scene changes in total. Our algorithm to locate such a path from a starting position to a destination is based on the view significance of viewpoint [7,18]. The visibility measure at a viewpoint P is defined by the 2D area of all the 3D surfaces visible from P, as well as the viewing distances.

In Figure 3b, we can notice that the paths with the largest visibility are along middle of the streets. We thus move along the computed path to take omni-video. This is also consistent to the path in the city traversing by Google vehicle. If a city has 3D elevation map available, our vehicle borne camera moves along a path close to the center of the road. We can hence move on to the distinct view extraction from omni-videos.

4. Extracting Distinct Views on Path According to the view significance computed by the area of visible objects, the most exposed path keeps maximum distances from the surrounding scenes so as to include large areas of scenes, This also implies the minimum motion parallax or scene changes in the omni-video among all paths, because the disparity or motion parallax is inversely proportional to the scene distances.

Given the elevation map of an urban area in LIDAR data [19], the view significance, denoted by σ(P), is measured at all free positions as shown in Figure 2. Given a starting and a destination location, we employ the shortest path algorithm [21] to find the path with the maximum accumulative view significance. The most-exposed path is calculated by minimizing objective function

In a panoramic view, vertical features provide more precise location information than non-vertical lines or curves extending over a large space. The distribution of vertical features in the

648

panorama provides orientation of features around the viewpoint. Their moving speeds in the video are also related to their distances from the camera. In order to obtain strong and distinct features, we accumulate the color along the vertical direction of panorama or radial direction of the fish-eye view to produce a one-dimensional data O(φ) as O (φ ) =

1 H

Heading

I(φ, ϕ, t)

stop

(2)

I (φ , ϕ ) ∑ ϕ

where φ∈[0, 2π], ϕ∈[-5°, H] and H is the height of panorama. The projected result is robust to the camera shaking and waving up to a certain degree [20], and it ignores insignificant features as well. By aligning O(φ) along the time axis, we obtain a temporal profile Q(φ,t) displaying traces of vertical features. Q(φ , t ) =

1 H

I (φ ,ϕ , t ) ∑ ϕ

∂Q(φ , t ) ∂t

Heading stop

Back

t Figure 4 An omni-video from a fish-eye lens with φ∈[0,360°], ϕ∈[-5°,90°] and two vertically condensed profiles Q(t,φ) in city blocks. The horizontal axis is the frame number of video. M(t) (b)

∂Q(φ,t)/∂φ

(4)

(a)

(c)

80° Maximum motion parallax in degree (b)



We enforce that the distinct views extracted from the video sequence should have an equal percentage of scene changes, i.e., to ensure a certain amount of scene overlap but not to over repeat the same scene in consecutive views. We propose our algorithm called motion equalization similar to the histogram equalization for a view digest. Assume distinct views Id, d =1,2,3,… with motion parallax M’(d) converted from M(t), M’(d) should be close to a constant at every moment d.

Average vertical feature distribution Selected frames

(a)

t

(c)

Figure 5 Selecting distinct views by equalizing the changes in motion parallax. (a) The movement of location sensitive features as edge traces from corresponding condensed profiles in Fig. 4. (b) Maximum motion parallax after smoothing. (c) Selected frames in bars by applying the motion equalization method.

From M(t), we estimate scaling factor R(t) at each sampled moment t to satisfy R(t)=K×M(t) where K is a constant to control the amount of parallax between frames. We then scale the length of video frames I(t) by R(t) along the t axis to generate a new sequence I’(t) with length s(t) as t=1,2,3…

turn

(3)

which is mainly contributed from scenes in side directions of the path (scenes in forward and backward directions have less motion in omni-video). Figure 5(b) displays the result of M(t) in degree along the time axis. Further smoothing it with a media filter can avoid noise mainly from a non-smooth camera trajectory due to the instantaneous camera/vehicle shaking on the road. The resulting distribution in Fig. 5(b) clearly shows high values at a narrow street with busy scenes, low value at open spaces, and a close-to-zero value at the place where the vehicle/camera stops.

s(t)=R(t)+s(t-1)

t

φ

named condensed profile of omni-video. The vertical axis is orientation φ and the horizontal axis is time t. Figure 4 shows such examples of Q(φ,t). We analyze the traces of movement from the profile, and further calculate the partial derivatives of Q(φ,t) with respect to φ and t directions to convert them to gradient ΔQ(φ,t), which is displayed in Figure 5(a). The tangent of a trace shows the motion parallax of a distinct vertical feature in the omnivideo. After smoothing it with a small window, we detect the maximum motion parallax at each viewpoint from the gradient ∂Q(φ , t ) M (t ) = max φ ∂φ

Q(t,φ) of a block

5. Experiments

(5)

Our omni-video is from an upward-pointing fish-eye camera mounted on a vehicle. The video frames are at 3~6m intervals, though we do not need the real distance between viewpoints, as our view evaluation is based on the scene changes in frames. The maximum M(t) in the given video is about 80 degree in φ angle around the camera between omni-video frames. We compute the motion parallax for motion equalization. The distinct views selected in Figure 5 maintain the maximum motion parallax for major vertical surfaces to be 30 degree approximately and the

mapped from the original t. Finally, the distinct views Id are sampled from I’(t) with a fixed interval τ. Figure 5(c) shows results of the selected frames according to s(t) in the original sequence. We can select an interval τ to sample time sequence I’(t) in order to keep the scene overlaps properly. The real frame of Id is marked in the original sequence I(t) in Figure 5(c).

649

Figure 6 A route sheet contains distinct panoramas along a path from omni-video. Orientations of the views are aligned and labeled in red below. The route directions (EÆE+NÆN+EÆE+NÆN+EÆE+N+EÆE where Æ and + signs denote moving and turning at crossings respectively) are visualized by a sequence of transparent parts and other directions are translucent. [3] Teller S., Toward urban model acquisition from geolocated images, Proc. Pacific Graphics 98, 1998, 45–52. [4] http://www.ptgrey.com/products/legacy.asp [5] Nayar S.K., Peri V., Folded catadioptric cameras, IEEE CVPR, vol. II, 1999, 217–223. [6] Degner P., Schnablel R., Schwartz C., Klein R., Effective visualization of short route, IEEE Visualization 08. [7] Cai H., Zheng J. Y., Key Views for Visualizing Large Spaces. J. of Visual Communication and Image Representation, vol. 20(6), 420-427, 2009. [8] Zheng J.Y., Tsuji S., Panoramic Representation for route recognition by a mobile robot, Int. J. Comput. Vision 9 (1), 1992, 55–76. [9] Zheng J. Y., Tsuji S., Barth M., Qualitative representation of route based on landmarks, ICCV 1991, (3), 2004-2009. [10] Zheng J.Y., Wang X., Pervasive views, area exploration and guidance using extended image media, ACM Multimedia 05, 2005, pp. 986–995. [11] Zheng J. Y., Shi M., Scanning depth of route panorama based on stationary blur, Int. J. Computer Vision, 78(2-3), 169-178, 2008. [12] Aliaga D.G., Funkhouser T., Yanovsky D., Carlbom I., Sea of images, IEEE Visualization, 2002, pp. 331–338. [13] Google Streetview: http://maps.google.com [14] Zheng J. Y., Li S., Employing a Fish-Eye for Scene Tunnel Scanning. ACCV (1) 2006: 509-518. [15] Yagi Y., Yachida M., Real-time omnidirectional image sensors, Int. J. Comput. Vision 58 (3) (2004) 173–207. [16] Li S., Nakano M., Chiba N., Acquisition of spherical image by fish-eye conversion lens, IEEE VR, 2004, 235–236. [17] Zheng J.Y., Zhou Y., Mili P., Scanning scene tunnel for city traversing, IEEE Trans. Visual. Comput. Graph. 12 (2) (2006) 155–167. [18] Cai H., Zheng J. Y., Locating key views for image indexing of spaces. ACM Multimedia Info. Retrieval 08: 31-38. [19] Hu J., You S., Neumann U., Integrating LiDAR, aerial image and ground images for complete urban building modeling, in: 3DPVT, 2006, pp. 184–191 [20] Zheng J. Y., Bhupalam Y., Tanaka H., Understanding vehicle motion via spatial integration of intensities, ICPR08. [21] Mark A. Weiss, Data Structure and Algorithm Analysis in C++ (2nd Ed.), Addison-Wesley, 1999.

obtained digest is 4% of the original video in number of frames. The number of frames resulting from the omni-video is scene dependent and can be adjusted. Figure 6 is an even more compact route sheet from the generated digest that is portable in navigation. It contains a sequence of distinct views along the urban path indicated by the red path in Figure 3. The forward moving and turning directions are indicated by open parts in the views, and the orientations are marked at bottom. Our experiments show that 10% overlap of consecutive views is not sufficient to provide route guidance purely based on views, because the large view hopping cannot be followed by users. A 30% overlap can merely provide scenes for matching between frames. It is preferred to have more than 50% overlaps for a smooth navigation. As compared to the normal video indexing algorithm that picks up key frames without any connections, our frame selection algorithm keeps the view continuity (parallax) at a certain degree to facilitate virtual navigation in the cyberspace. We can also adjust the frame density according to the changes along a path, which is a numerical way to describe the overlaps or scene changes.

6. Conclusion This work generates a digest of urban omni-video along a route for virtual/real navigation and scene indexing. We maintain continuity of consecutive views by computing the motion or changes in the video along the most exposed path. Given the minimum motion parallax, the resulting distinct views along a path have low distribution at where the same scenes are constantly visible at distant, and have high distribution at where close and large scenes are passed quickly. The digest is then efficient and effective in presenting route scenes than video in terms of data size. A further compact sequence of landmarks can also be generated in a route sheet for guiding navigation.

7. References [1] Coorg S., Master M., Teller S., Acquisition of a large posemosaic dataset, IEEE CVPR 98, 1998, pp. 23–25. [2] Yagi Y., Imai K., Tsuji K., Yachida M., Iconic memorybased omnidirectional route panorama navigation, IEEE PAMI, vol. 27, no.1, 2005, pp. 78–87.

650

Suggest Documents