Camera-Based System for Tracking and Position ... - Semantic Scholar

3 downloads 29955 Views 1MB Size Report
allow an immediate call for help. ... falls of elderly people as soon as possible to call for help. .... of a upright standing person with the same center of mass.
Camera-Based System for Tracking and Position Estimation of Humans Robert Hartmann, Fadi Al Machot, Philipp Mahr, and Christophe Bobda University of Potsdam Department of Computer Science August-Bebel-Str. 89 D-14482 Potsdam, Germany E-mail: {hartmann, almachot, pmahr, bobda}@cs.uni-potsdam.de

Abstract The human population is getting older and older, as stated by current studies. Because elderly people are at a higher risk of in house accidents there is an increasing need for ambient assisted living systems. These systems should detect accidents or dangerous situations in order to improve the quality of life for these people. The goal of this work is to build a robust and intelligent system which estimates the position of humans using only one camera. The position is used to detect falls and to allow an immediate call for help. The solution is based on a foreground-background-segmentation using Gaussian Mixture Models to first detect people and than analyze their main and ideal orientation using moments. This allows to decide whether a person is staying or lying on the floor. The system has a low latency and a detection rate of 88% in our case study. Index Terms—Cameras, Image analysis, Image segmentation, Optical position measurement, Optical tracking

I. Introduction Each year in England a third of the population whose age is over 65 has a fall, whereas half of these persons fall at least two times. Women are at a greater risk than men, with half of all women aged over 85 having a fall in any one year [1]. It is very important to find a technical solution to detect falls of elderly people as soon as possible to call for help. When for instance a heart attack is the reason for the fall, each second is very important to save the person’s life. So the system must have a low latency, so that an alarm is sent as soon as possible. The aim of this work is to develop an algorithm that processes video streams to detect people and analyze their

c 978-1-4244-8735-6/10/$26.00 2010 IEEE

position in space, especially recognizing people who are lying on the floor. The problem of object detection and evaluation is well established in computer science. In general model and training based approaches are used. Viola and Jones [2] presented AdaBoost in 2001, where a boosted cascade of classifiers based on haar-like features is used. Integral images where used to speed up the process. In [3] Ashbrook et al. used pairwise geometric histograms (PGH) to detect objects even with occlusion or scene clutter present. Rotation invariant histograms are build out of geometric parameters of line segments. The PGH calculated out of training images are later compared against the PGH from the scene image using the Bhattacharyya metric. Another way to solve the problem is the usage of optical flow [4]. The flow of walking people is directed uniquely for the whole body. If the person falls, then its optical flow is pointing into several directions. This fact could be used to detect falls as abnormal behavior of the optical flow. But model and training-based as well as optical flow approaches require a lot of training respectively computational power. A model-free and computational inexpensive approach is needed in order to have an easy and small solution, which allows an integration to our FPGA-based smart camera [5] without the need of a bigger FPGA. Therefore a model free approach is used. The usage of one camera only leads to greater acceptance for in-house usage due to its minor need for technical devices, which we think is important in assisted living environments. The overall structure of the paper is organized as follows. In section II we give a brief overview of existing technologies and projects that also address our problem. In section III we describe our solution in detail and point out the theoretical details. Section IV describes and evaluates our implementation. Finally we conclude the paper in section V and give an outlook of further work to do.

62

II. Related Work

ceiling. The special perspective of such kind of lenses gives an easy parameter to detect whether people are staying or lying, because the main axis of standing people is pointing to the middle of the image. This property is the basic principle of the algorithm. When the main axis of the segmented person is differing from its expected direction to the middle of the image than we classify the person as lying. Falls are detected as moving from an upright position into a lying one within a small time slot. The steps are shown in figure 1.

The Institute for Robotic of the university of Braunschweig, Germany is developing a system that will ensure a long, independent and secure life for elderly and handicapped people in their home environment. An initial supervision system is already operational and currently undergoing tests. The image processing system is able to automatically detect people using a camera and to decide whether an ambulance should be informed or not. The institute has developed several model-free and model-based image processing algorithms that enable the tracking of the people and the detection of falls in a room [6]. The system has also been developed for active supervision approaches that allow the identification of individuals who fall during the night, i.e. in the dark. Therefore, infra-red emitters are attached to various positions at the ceiling. Whenever these lamps turn on and off, the person supervised casts a shadow in different directions. The shadow information is used to distinguish between people standing and people lying on the floor [7]. However, the system only works in covered areas provided with a huge amount of signals. This requirement is hard to obtain in the most living environments of elderly people. Also the segmentation approach is not adaptive to small background movements like they normally occur in-house. A team at the University of Massachusetts uses a network of overlapping smart cameras in a decentralized procedure. They can compute inter-image homographies that allow the location of a fall to be detected in 2D world coordinates by calibrating the respective data. The aim of their work is to build a system which implements fall detection procedure. They succeeded by using a more sophisticated support vector machine (SVM) classifier. 40 image sets where used to train the system on each image in order to build a classifier which decides whether an event is regarded as fall or not. The system requires the manual calibration of one camera per group of overlapping cameras. The localization error of the system is less than 50cm. The system comprises very low-power, lowresolution cameras and motes can be used to detect and localize falls [8]. But as mentioned before we try to build a model free system without the need of training data and manual calibration. Furthermore, the usage of several cameras per room is a exclusion criterion, because this technical requirement can not be guaranteed in most living environments.

Fig. 1. The working flow of the system a) Capture the current image of the room. b) Segment the moving object from the static background using adaptive GMM. c) Calculate the ideal orientation φ as it should be for upright persons. d) Calculate the main orientation θ of the person. e) Use φ and θ to decide whether the person is upright or lying on the floor. Now we will introduce the system in detail by looking at each step individually.

A. Image capturing The most important part of the image capturing process is the fish eye view of the camera. It has several advantages. First, it can cover big rooms due to its wide angle view. It captures 180x160◦ . Second, the more important fact for the algorithm, the main axis of upright standing people on an image is always pointing to the middle of the image, as can be seen in figure 2. The second fact is only given if the camera lens is mounted in parallel to the rooms floor, i.e. the image capturing direction is perpendicular to the floor. But in normal rectangular rooms this property is always given if we mount the camera at the ceiling of the room. A third advantage is that the camera automatically deskews the image after capturing for an easy processing.

III. The System B. Segmentation To explain our approach to the problem, the system is described in general. Humans will be monitored through a fish eye camera, which is situated in the middle of the

The second part of the algorithm is the foreground background segmentation to obtain the mask of the person.

63

Now we should maximize the likelihood function by calculating δL δθ = 0, but this would be very difficult. Therefore, we use a simplified algorithm called EM (ExpectationMaximization). Expectation-Maximization (EM) maximizes the likelihood of fitting a mixture model to a set of training data. It is very important for this algorithm that the a priori selection of the model order and also the number of k components is incorporated into the model. But mixture models do not work well in case of constant repetitive motion and high contrast between pixel values (edge regions) because the object appears in both the foreground and background [10].

C. Calculating the orientation

Fig. 2. Example image view of the fish eye camera (without deskew [9])

In order to locate moving foreground objects in videos, a segmentation technique called Gaussian Mixture Model (GMM) is used. GMM is an important tool in image data analysis. The segmentation of images means to divide the video frame into different types of classes or regions background and foreground. Therefore we can suppose that each pixel belongs to a Gaussian distribution with its own variance value and covariance matrix. These parameters of the model are learned by Expectation-Maximization algorithm. Each mixture component P consists of a Gaussian with mean µ and covariance matrix , i.e. in the case of a 2D color space: p(ξ|j) =

2π|

1 P

T

j|

exp0.5(ξ−µj ) 0.5

P−1 j

(ξ−µj ))

After segmentation of the image into foreground and background we have to determine the two main angles φ and θ. θ is the main orientation (main axis) of the detected object, see fig 3. φ is the expected orientation of a upright standing person with the same center of mass as the detected person, see figure 4.

(1)

So, by a Gaussian background mixture model each pixel is modeled as the sum of k weighted Gaussians. The weights show the frequency and the Gaussian is identified as part of the background model, updated adaptively with learning rate α and the new observation. The idea of Maximum Likelihood Estimation (MLE) is to assume a specific model with unknown parameters, then the definition of the probability of observing a given event, conditional on a specific set of parameters. That means when one has observed a set of outcomes in the real world. It is then possible to choose a set of parameters which are most likely to have produced the observed results. If we have a likelihood function this could be a mathematical distribution f (x1 , x2 , .., xn ; Θ1 , Θ2 , ..., Θn , ), where the data xi ’s are normally distributed (Gaussian distribution) and independent, we can calculate the parameters Θ: L(X|Θ) =

n Y i=1

f (xi , θ)

(2)

Fig. 3. Main orientation θ of an abstract object (ellipse) within an image using a fish eye camera

Fig. 4. The calculation of the ideal orientation φ The two angles are calculated using moments. Moments are the processing of certain weighted averages of the

64

D. Fall decision

values of each pixel of an image. They are usually chosen so that they reflect the desired characteristics of the image. They are useful to describe individual objects in a segmented image. The definition of moments of the gray value function IB (x, y) of an image is as follows, where p and q are the order of the moment [11]: Z Z

xp y q IB (x, y) dxdy

µp,q := Y

Now that the orientation of a person is calculated we have all relevant data to make the final position estimation of a person and decide whether there was a fall or not. The main decision parameter is the deviation between the ideal orientation φ and the main orientation θ of the segmented object. A fall is detected in one frame if

X

|φ − θ| ≤ ,

The integration is calculated over the area of an image IB . When we have segmented an object, we get a binary image mask M whose pixels contain the value of either one (where the object is) or zero (else). Than we also can integrate over the objects area: Z Z µp,q := Y

see also figure 5. The specific threshold  was determined empirically, i.e. the angle of a falling person has been observed several times. To ensure the robustness of the

xp y q IB (x, y)M (x, y) dxdy

X

Zero order moments µ0,0 are the sum of the pixel values of an image. First order moments µ1,0 and µ0,1 describe the horizontal and vertical center of mass of an object. µ0,0 :=

XX x

P P x

µ1,0 :=

IB (x, y)

y

y

xIB (x, y)

µ0,0 P P x

µ0,1 :=

y

Fig. 5. Example view of a fall with a large deviation between the main orientation θ (black) and the ideal orientation φ (green).

yIB (x, y)

µ0,0

They are later used to calculate the ideal orientation φ of a person, relative to its center of mass. Second order moments are the squared values of the row- or columnscounters multiplied with the values of rows of the image. The normalized second order moments are associated with the orientation of the object, used for the main orientation θ: XX µ2,0 := x2 IB (x, y) x

µ0,2 :=

µ1,1 :=

y

XX x

y 2 IB (x, y)

y

XX x

approach and to avoid false positive detections due a one frame error we also add a frame counter. If we detect a person lying on the floor in one frame we increment a counter c. If we don’t detect it we reset the counter, c = 0. If c increases beyond a predefined threshold than we alert a detected fall. It needs to be mentioned that we perform the orientation estimation for each detected person separately. Therefore the foreground mask is separated into connected components and the algorithm analyzes each component independently. This way also multiple orientations in a room can be evaluated and therefore multiple falls can be detected.

xyIB (x, y)

y

E. Shadow detection

The orientation of the object is defined as the tilt angle between the positive x-axes and the axis around which the object can be rotated with minimal inertia [12]: θ :=

One of the most frequent problems of computer vision systems which are deployed in environments suffused with light are shadows. Especially the background subtraction operator is affected by shadows, since shadows are detected as part of the element in motion. But despite of

1 2µ1,1 tan−1 ( ) 2 µ2,0 − µ0,2

65

using a Gaussian mixture model to segment the objects, the shadows still posed a problem to the performance, and so an algorithm is found to detect the shadows and delete them from the images. Cucchiara et al. [13] proposes the usage of the three parameters of the HSV color system (hue, saturation, value (brightness)) to detect shadows. The HSV color space corresponds closely to human perception of color, where the color information improves the discrimination between shadow and object. Intuitively, of course, when there is a shadow on a background texture, the hue value (color) stays almost the same, the saturation value decreases and the shadow causes a reduction of the brightness value, because shadows make textures look darker than normal. So for each foreground pixel its HSV values are compared to the HSV values of the pixel at the same position in the learned background image.: α≤

Vi (x, y) ≤β VB (x, y)

Si (x, y) − SB (x, y) ≤ τs

A. Discussion After finishing the implementation of the algorithm the system was tested under several conditions. The test consists of nine positions within a room sized 6x4 meter. For each position there were five iterations to test the system. That means that the number of tests was 45. The test scenario included one person who fell to the floor of our test room, observed by the camera. Sometimes the person fell with motion and sometimes without motion. We take the fall of a person as positive event that we need to detect and the remaining time as negative event that we should not detect as a fall. The test time was in the afternoon using the light of the room which was coming from the side. This means far bigger shadows which made the segmentation more complicated. The figures in 6 are examples of the test. In some tests we noticed that if the body is partly covered by a chair or a table, the detection still worked correctly. The average time of reaction to a fall was one second. That time was used to double-check the decision to remove false positive events.

|Hi (x, y) − HB (x, y)| ≤ τh where i is the current frame and B is the background frame. If a foreground pixel fulfills all the conditions it gets removed from the foreground mask. After removal of the shadows the performance of the system has been highly increased. The results of the shadow detection are good. The quality of the overall algorithm is discussed in section IV. The parameters of the shadow detection are 0.2 for α, 0.95 for β, 5 for the τs and the τh is 15.

IV. Implementation The system was implemented using C++ and OpenCV1 . OpenCV is an open source computer vision library from Intel. It contains more than 500 computer vision functions for medical imaging, security, stereo vision, camera calibration, user interface and robotics. It also contains a full, general purpose Machine Learning Library (MLL). All tests were performed on an Intel(R), Core(TM)2 Duo CPU 2.00 GHz CPU with 2048 KByte cache size and 2 GByte of RAM. A network camera from Mobotix model Q24 was used, 1280 * 960 pixel, 30 images/s, alarm signal controller, multi screens and fish eye view. The objective reaches a 180x160◦ view with a small 1.8mm focal length. With a normal ceiling height of 3m the camera can monitor up to 55m wide rooms. 1 http://opencv.willowgarage.com/

Fig. 6. The Test of the System To measure the performance of the system we make use of sensitivity and specificity. Sensitivity measures the proportion of correctly identified positive events and specificity measures the proportion of correctly identified negative events. 40 of the 45 falls are identified correctly in our tests. Additionally we analyzed 43 out of 45 of the remaining time correctly without a fall. Only 4 false positives where detected in that time period. That is 88% of true positives were detected and 91% of the true negatives were detected. Specificity =

TN 43 = = 91.5% TN + FP 43 + 4

Sensitivity =

TP 40 = = 88.8% TP + FN 40 + 5

The result of the sensitivity can of course be improved by adjusting some parameters of the algorithm. For instance the threshold to determine the deviation between main and ideal orientation as fall. The smaller this value the more sensitive the algorithm. But that comes in hand

66

with a big reduction of specificity. That would mean a little more true positives for the price of a lot more false positives. This bias is always a trade off between catching all falls or having as little false positive alarms as possible. Figure 7 illustrates a weak spot of the system at which it is difficult to achieve the threshold value. The reason is that the ideal orientation, indicating an angle for upright persons, is the same as the real orientation, indicating the angle of the fallen person. A solution to that problem could be to include the dimension of the foreground mask into the decision step.

FPGA the algorithm would speed up a lot. This will give a compact all-in-one solution for in-house assisted living.

References [1] Mortality figures for accidental falls. Office of National Statistics, 1998. [2] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. vol. 1, p. pp. 511, 2001. [3] A. P. Ashbrook, N. A. Thacker, and P. I. Rockett, “Multiple shape recognition using pairwise geometric histogram based algorithms,” Fifth International Conference on Image Processing and its Applications, vol. vol. 5, pp. 90–94, 1995. [4] J.-Y. Bouguet, “Pyramidal implementation of the lucas kanade feature tracker description of the algorithm,” Intel Corporation, Microprocessor Research Labs, Tech. Rep., 2004. [5] C. Bobda, A. A. Zarezadeh, F. M¨uhlbauer, R. Hartmann, and K. Cheng, “Reconfigurable architecture for distributed smart cameras,” in International Conference on Engineering of Reconfigurable Systems and Algorithms, 2010. [6] J. Spehr, M. G¨overcin, S. Winkelbach, E. Steinhagen–Thiessen, and F. Wahl, “Visual fall detection in home environments,” 6th Int. Conference of the Int. Soc. for Gerontechnology Pisa, Italy, 2008. [7] J. Spehr, Beaufsichtigung von Personen im h¨auslichen Umfeld: Grundlagen und Konzepte zur Verwendung einer Fischaugenkamera. Vdm Verlag Dr. M¨uller, 2007. [8] A. Williams, D. Ganesan, and A. Hanson, “Aging in place: fall detection and localization in a distributed smart camera network,” in MULTIMEDIA ’07: Proceedings of the 15th international conference on Multimedia. New York, NY, USA: ACM, 2007, pp. 892–901. [9] Q24M Kamerahandbuch. MOBOTIX AG, 2010. [10] Y. Raja, S. J. McKenna, and S. Gong, “Segmentation and tracking using color mixture models,” in ACCV ’98: Proceedings of the Third Asian Conference on Computer Vision-Volume I. London, UK: Springer-Verlag, 1997, pp. 607–614. [11] L. Kotoulas and I. Andreadis, “Image analysis using moments,” in 5th Int. Conf. on Tech. and Automation, 1998, pp. 360–364. [12] J. Kilian, “Simple image analysis by moments,” OpenCV library documentation, Tech. Rep., 2001. [13] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, and S. Sirotti, “Improving shadow suppression in moving object detection with hsv color information,” in Proceedings of the Fourth International IEEE Conference on Intelligent Transportation Systems, 2001.

Fig. 7. Fall position of a person where it is difficult to detect

V. Conclusion This work presents a solution to estimate the position of people in space, which is used to detect falls. The solution is based on the calculation of the deviation of the main and ideal orientation of segmented objects from a fish eye camera, where the main orientation of standing people is pointing to the middle of the image. Moments are used to calculate the orientation and the deviation between the two main and ideal angle. The orientation is then compared with a specific threshold to decide if the person has fallen. The system has a low latency and a detection ratio of 88%. The high sensitivity and specificity renders this system fit for usage in assisted living environments. The usage of only one fish eye camera per room and a standard PC makes it also easy to integrate into a normal apartment. In addition the system could be provided with infrared vision to detect falls during the night. In the future we want to integration the algorithm into our smart camera [5], whereas the most computation intensive task is the segmentation using GMMs. When realizing this part as hardware in the FPGA and rest of the algorithm in software running on the PowerPC of the

67

Suggest Documents