plies the use of video format conversion (VFC) techniques, as computer displays and .... T indicate the transpose of a block of pixels B(~X) in the current field n.
High Quality Video on MultiMedia PCs A. Pelagotti and G. de Haan Philips Research Laboratories Prof. Holstlaan 4 5656 AA Eindhoven (NL) Abstract Displaying broadcast video on a MultiMedia PC, implies the use of video format conversion (VFC) techniques, as computer displays and television receivers use quite distinct scanning rasters. VFC consists of spatial scaling, deinterlacing and picture rate conversion. Although scaling is rather straightforward, the other two tasks are far from trivial, and advanced motion compensated interpolation techniques are necessary to achieve a performance level that can compete with that of a standard TV. This paper discusses the options for picture rate conversion, and shows how even advanced motion compensated algorithms can run real time on a currently available programmable device.
1. Introduction Common video cameras use picture rates of 50 or 60 Hz, movie films are recorded at 24, 25 or 30 Hz, while the picture rate of the PC monitor lies generally between 60 and 120 Hz. Thus, to interface broadcast video and a PC display, image frames have to be interpolated at time instances where the original sequence has not been sampled. At present, frame repetition is used for video in PC, while a variety of different algorithms have been developed to convert between formats in the TV-environment. The first algorithms developed were purely static, i.e. did not take into consideration the motion in the scene. Using these techniques an annoying motion jerkiness or blurring in moving areas is likely to appear [2]. In order to achieve a high quality for video in PC, comparable with that of a TV receiver, advanced motion compensated interpolation techniques are necessary [3]. Frame rate up-conversion can be realised with dedicated hardware or a programmable device. We will present an advanced motion compensated algorithm running real time on a currently available programmable device. This paper is organised as follows, section 2 introduces the problem and
section 3 provides the details on the proposed algorithm. We evaluate the proposal in section 4, and we draw our conclusion in section 5.
2. The problem statement Repetition of a picture until a new one arrives, is the common way to realise picture rate up-conversion in a PC environment. But it results in discontinuities in the observed motion of the objects. In general, it leads to artifacts varying from ”motion judder” (if the difference between input and output picture rate is below approximately 30 Hz) to ”motion blur” (for higher difference frequencies). The blurring effect can be understood as follows: the viewer tracks the object in the successive images expecting a regular sequence of positions. When an irregular motion of the object appears, our visual system concludes that there must be two objects moving in parallel (fig.1). To enable a subjective impression of such an image we included fig.2. A linear processing, such as e.g. averaging between two successive frames, temporal averaging, will not bring a perfect interpolation for moving objects either. In the interpolated frame, the objects will appear in both the previous and the following positions, but not in the ones which are correct for the time instance of the up-converted picture. Using this technique three objects become visible (fig.3). A good motion mode can instead be achieved with a MC (motion compensated) algorithm which interpolates the object at the correct position along the motion trajectory (fig.4). A motion compensated interpolation is generally done by averaging neighbouring fields after compensate from motion. In order to show the tremendous improvement in motion portrayal achieved in this way, we would need moving images. As an attempt to show it on paper, fig.5 is presents a comparison between a motion compensated image and an image resulting from temporal averaging, in terms of Mean Squared Error (MSE). This measure is insufficient to completely judge image quality, but, by evidence,
Position
Position
Time Original frame Repeated frame
Time Original frame
(b)
Figure 2. On the left side: Subjective impression is blurred when frame rate conversion uses frame replication while the picture rate difference is higher than approximately 30 Hz. On the right side: Motion compensated frame-rate conversion results in temporally smoother and spatially sharper perceived images.
2 objects
Time Original frame Repeated frame Expected position
(c)
Figure 1. Frame-rate conversion from 60 Hz to 120 Hz: (a) Input stream showing a moving ball, (b) replicating each image until a new one is available, (c) the appearance to the viewer, who expects the ball at positions different from those in which it appears, resulting in the observation of a double ball.
the difference in MSE among the non-MC and the MC options is huge. Clearly, we are not dealing with a minor improvement but rather with a very significant effect. In order to use an MC algorithm, the estimation of motion in a scene is required. In this paper we assume the availability of true motion vectors. The possibility of performing motion estimation real time on a programmable device is presented in [3] and in [10]. We assume that the incoming field will be de-interlaced with one of the available algorithms. For a thorough evaluation of de-interlacing algorithms we refer to [1] and for de-interlacing on a programmable device to [9]. Independently from the quality of the motion estimation, every MC conversion is confronted with the problem of occlusion in a sequence. In pictures resulting from MC field rate converters an artifact is generally visible at the boundary of moving objects where either covering or uncovering of background occurs. This is due to the fact that, when using two frames for the motion compensation process, it is assumed that the needed information should be present in both frames. In fact, if we want to estimate the motion between two successive frames and/or to interpolate an image between a previous image and the next one, for covered
Position
Position
(a)
3 objects
Time Original frame Temporal average Expected position Figure 3. Using temporal averaging to interpolate missing images will results in the observation of three objects moving in parallel
areas the useful information is present only in the previous frame, and for uncovered areas it is available only in the next frame. Consequently the motion estimator often provides non-reliable vectors for these areas. Applying these vectors in MC interpolation will result in visible, and often quite annoying, artifacts. In [4] a method was proposed for MC picture interpolation that reduces the negative effect of covering and uncovering, using an order statistical filter in the up-conversion to replace the MC-averaging. This algorithm has been implemented in a commercially available IC for motion compensated television scan rate conversion [6]. The current algorithm is an elaboration of this original idea, and basically adapts the interpolation strategy using a segmentation of the image in less and more complex areas. The resulting processing is not only better in terms of image quality, but at the same time the computational complexity is reduced, allowing real time processing on a programmable device.
Position
Time Original frame MC frame
Figure 4. Motion compensated frame rate conversion: the interpolated position of the object is deduced from the adjacent images, a new picture is calculated at the intermediate time instance, using motion compensation techniques.
Figure 5. The difference in Mean Squared Error of images interpolated with and without motion estimation is huge. This graph shows that this is true both for (high priced) professional scan rate converters and for converters applying a consumer priced estimator.
3. The proposed algorithm In order to detect areas in which covering or uncovering occurs, the current algorithm just needs the information that is available in a motion vector field related to that frame, and very limited processing of that information. The first step of the algorithm consists in detecting significant discontinuities in the given motion vector field, assuming that these correspond to the borders of moving ob~ (X; ~ n) be the displacejects. In a more formal way, let D ~ ment vector assigned to the centre X = (Xx; Xy )T , where T indicate the transpose of a block of pixels B (X~ ) in the current field n. In order to detect horizontal edges, we check the vector ~ l (X~ K; ~ n) and difference of the displacement vectors D T ~ n), where K ~ = (k; 0) and k = constant, D~r (X~ + K; ~ ~ and Dl and Dr are motion vectors assigned to blocks situated on, respectively, the left and the right hand side of ~ ) in the current field n. In order to detect every block B (X vertical edges, we take into consideration vectors assigned to blocks located, respectively, above and below the current ~ t (X~ C; ~ n) and D~b (X~ +C; ~ n) where C~ = (0; k)T . block D When either of the absolute differences for x and y components is higher then a threshold value:
jx
~ (X ~ D ) l ~ K;n
jy
~ (X ~ D ) l ~ K;n
jx
~ ~ (X ) D t ~ C;n
jy
~ (X ~ D ) t ~ C;n
x
~ (X ~ D ) r ~+K;n
j> thre
(1)
y
~ (X ~ D ) r ~+K;n
j> thre
(2)
x
~ ~ (X ) D b ~+C;n
j> thre
(3)
y
~ (X ~ D ) b ~+C;n
j> thre
(4)
we decide that there is a significant edge within the block ~ = (Xx; Xy)T . centered in X Once we have a clear classification of the areas in the interpolated frame as belonging to two distinct categories, i.e. present in both frames and covered or uncovered, we can design an ad hoc interpolation strategy.
3.1. The interpolation strategies In this section two of the most popular algorithms for interpolating frames in between existing ones are presented (for a more detailed analysis see also [8]), together with the one proposed in order to exploit the acquired information on where ‘difficult areas’ are. A prior art method would yield an interpolated picture, using a motion compensated average, according to:
F (~x; n + ) = 1 ~ (~x; n); n)+ F (~x +(1 )D ~ (~x; n); n +1) F (~x D
2
(5)
where F (~x; n) is the luminance value of the pixel at ~x space position and n temporal position, n+ is the position in time of the up-converted frame between the two existing ones, and (0 > > > > ~ (~x; n); n +1) ; < F (~x +(1 )D 1 > ~ (~x; n); n)+ F (~x D > 2 > > > > : F (~x +(1 )D ~ (~x; n); n +1 ;
4. Evaluation Many error measures have been proposed as quality criteria for digital video, but since the human visual system is a very complex system, up to now the visibility of image distortions cannot be entirely described by one objective criterion. Nevertheless, some measures seem to have a higher correlation than others with the perceived quality, and some are just more popular than others. We have used two quality measures: MSE and Subjective Mean Squared Error (SMSE). The MSE is often used as a quality criterion in image processing, although the correlation between MSE and subjective judgements is proven to be not always high. More recently the SMSE has been introduced [7][2]. The SMSE is defined by:
SMSE = ( (un)covering areas (9) otherwise
i.e. we propose to use the median approach in covering/uncovering areas and the mc averaging otherwise. This approach provides a generally improved output with respect to both methods described in eq. 5 and in eq. 6 (see next section for an evaluation). Moreover the operation count can be reduced on average in comparison with what is required by the median method described in eq. 6, since the median filtering is needed only for a portion of pixels in the frame that is generally not bigger than 10%. This method seems to be particularly interesting for software implementation, e.g. on the Philips Trimedia processor, since it provide the same or better quality of a ‘median’ approach, with an operation count on average reduced by about one third. We will refer to this method as the averaging and median filtering combination method (AV/MED).
XX n
1
NM
m
j d(x; y) j
p
1
)p
(10)
x=1 x=1
where d(x; y ) = ((x; y )g (Fa (x; y ) Fb (x; y )). Being F (x; y ) the luminance value a frame at position (x; y ), g(x; y) a function that provides the visibility of the error, (x; y) a weight related to the informative value of the pixel at position (x; y ) and p a factor to determine the relative importance of small and large errors. In this paper we selected:
g(x; y) = M22(F (x; y) = 1 p=3
ref
x; y)) M22(F
(
x; y))
int (
(11)
where M22 (F (x; y )) denotes a linear average in a 2 2 pixel window to describe the low-pass characteristic of the human eye. In order to evaluate the various interpolation strategies with the above described error measures, two sequence have been chosen: ‘Renata’, and ‘Flowers’. The sequences have been sub-sampled with a factor of three in the temporal domain. That means that frames 1,
5. Conclusion Table 1. Error figures on the ‘Renata’ and ‘Flower’ sequences for different interpolation strategies
QUALITY OF INTERPOLATION STRATEGIES RENATA MC AV MED AV/MED MSE 51.478 55.416 50.751 SMSE 180.743 156.633 156.73 FLOWER MC AV MED AV/MED MSE 75.974 87.054 75.457 SMSE 4131.01 4207.57 3906.47
Table 2. Operations per pixel required by the various interpolation strategies in a software implementation SOFTWARE COMPUTATIONAL COSTS INTERP STRATEGY PEAK LOAD AVERAGE LOAD MC AV 1 1 MED 4 4 AV/MED 4.3 1.3
Table 3. Operations per pixel required on average considering the peak load occurring on 10% of the pixels in a sequence INTERP STRATEGY MC AV MED AV/MED
OP/PIXEL 1 4 1.6
4, 7,.. were used as input. Frames 2, 3, 5, 6 were then interpolated in order to regain the original frame rate. The ‘deviations’ were then calculated for each pixel in the interpolated frame with respect with the original information, after leaving some boundary at the edges of the frame. This information was averaged over 32 frames (16 fields). The MSE and the SMSE scores of the different interpolation strategies for two test sequences are shown in table (1). We found the SMSE to better match the subjective evaluation then the MSE. An evaluation in terms of operation count of the various interpolation strategies has also been carried out. The detection of the blocks where covering or uncovering occurs, costs approximately 0:3 operations per pixel in a software implementation. If we consider a peak load occurring on average on 10% of the pixel in a sequence, we can approximately estimate the computational efforts reported in table (3).
In order to achieve high quality video in PCs, a motion compensated up-conversion has to be used. The quality of a motion compensated up-conversion is mainly stressed at moving edges in covering/uncovering areas, where the straightforward solution, a motion compensated average, can produce strong and annoying artifacts. The algorithm described in this paper aims first at localising, in a robust and cost-effective way, covering/uncovering areas. In order to reduce artifacts in those areas, an up-conversion strategy is chosen that provides a comparable quality to that of state of the art methods for high quality 100 Hz TV receivers, while gaining on average, compared to them, a factor 3 in number of operations. This make the method particularly interesting for software implementation, e.g. on the Philips Trimedia processor.
References [1] E. Bellers and G. de Haan. De-interlacing-an overview. In Proc. of the IEEE, volume 86/9, pages 1839 – 1857, September 1998. [2] H. Blume and H. Schr¨oder. Image format conversion - algorithms, architectures, applications. In Proc. of the ProRISC/IEEE Workshop on Circuits, Systems and Signal Processing, pages 19 – 37, Mierlo, NL, November 1996. [3] G. de Haan. Judder-free video on PCs. In Proc. of the WinHEC98, March 1998. [4] G. de Haan, P. Biezen, H. Huijgen, and O. Ojo. Graceful degradation in motion compensated field-rate conversion. In Signal Processing of HDTV, V, L.Stenger, L. Chiariglione and M. Akgun (Eds.), Elsevier, pages 249 – 256, Ottawa, Canada, October 1993. [5] G. de Haan, P. Biezen, H. Huijgen, and O. Ojo. An evolutionary architecture for motion-compensated 100Hz television. In IEEE Trans. on Cicuits and Systems for Video Technology, volume 3/5, pages 368 –379, October 1995. [6] G. de Haan, J.Kettenis, and B.Deloore. IC for motion compensated 100Hz TV with a smooth-motion movie-mode. In IEEE Trans. on Consumer Electronics, volume 42, pages 165 –174, May 1996. [7] H. Marmolin. Subjective MSE measures. In IEEE Trans. on Systems, Man and Cybernetics, volume 16/3, pages 486 – 489, May/June 1986. [8] O. Ojo and G. Haan. Robust motion-compensated video upconversion. In IEEE Trans. on Consumer Electronics, volume 43, pages 1045–1057, August 1997. [9] A. Riemens, R. Schutten, and K. Viesser. High speed video de-interlacing with a programmable Trimedia VLIW core. In Proc. of the ICSPAT ‘97, pages 1375–1380, San Diego, September 1997. [10] R. Schutten and G. de Haan. Real-time 2-3 pull-down elimination applying motion estimation/compensation in a programmable device. In IEEE Trans. on Consumer Electronics, volume 44/3, pages 930–938, August 1998.