the use of lexical basis functions to characterize

3 downloads 0 Views 169KB Size Report
This method estimates the similarity of each pair of images in a set of face images by two ... then be compared to the coefficient values associated with each of the face .... bald, brunette, curly haired, and thin lipped. In doing so, we ... freckly full-mouthed girl hairy-faced hard-mouthed light-bearded long-haired maiden man.
THE USE OF LEXICAL BASIS FUNCTIONS TO CHARACTERIZE FACES, AND TO MEASURE THEIR PERCEIVED SIMILARITY John A. Black, Jr, Kanav Kahol, Prem Kuchi, Sethuraman Panchanathan Visual Computing and Communications Lab Arizona State University, PO Box 5406 Tempe, AZ, 85287-5406, USA ABSTRACT

2. BACKGROUND AND RELATED WORK

Over the last decade researchers have devised algorithms that can provide similarity measures between pairs of face images. These have been somewhat successful in estimating the similarities between face images under controlled conditions. However, those similarity measures do not parallel subjective similarity, as perceived by humans. In some applications it i s important to have a similarity metric that closely parallels that of humans. This paper describes a method for discovering the high-level features that are used by humans to judge facial similarity through the use of “lexical basis functions” gleaned from a lexicon of the English language. This method estimates the similarity of each pair of images i n a set of face images by two independent methods – by the subjective evaluation of human observers, and by the use of “lexical basis functions” to represent the multidimensional content of each image with a feature vector. The similarity measure computed with these feature vectors is shown t o correlate with the subjective judgment of human observers, and thus provides both a more objective method for evaluating and expressing image content, and a possible path to automating the process of similarity measurement in the future.

Beginning in the 1970s, theories of face processing began appearing in the psychological literature. These theories attempted to explain psychophysical research data on the abilities and limitations of humans to extract information from views of faces, as well as the sometimes surprising deficits exhibited by some patients. However as O’Toole et. al. observed [6] these theories were found to be less than adequate to guide researchers who were attempting to build quantitative computational models. In fact it was the attempt to build such models that made clear the enormous complexities involved in face analysis and recognition. Although much has been learned by neuropsychologists about early visual processing since that time, the high-level processing used to recognize faces is still poorly understood. As a result, in 1995 Chellappa et. al. observed that “very little synergism exists between studies in psychophysics and the engineering literature” [5]. They then added that “Barring a few exceptions... research on machine recognition of faces has developed independent of studies in psychophysics and neurophysiology.” O’Toole et. al. [6] observed that it is difficult to describe faces because of the “elusiveness of a feature list with which they can be described in a precise and globally agreed upon language” despite the fact that face analysis is a routine and vital part of our daily social interactions. Researchers who tackle the problem of face recognition must deal with the fact that individual variations in faces that all of us routinely perceive, are defined by a very large number of independent (or semi-independent) variables. Thus, a very highdimensional “face space” is needed to encode the unique combination of these variable values that define each face. Given the limitations of computational machinery, it has been imperative to find a way to map this high-dimensional face space onto a lower dimensional space. One method for doing this is to use principle component analysis, which represents a set of correlated variables in terms of a smaller set of non-correlated variables, without losing important or relevant information. This approach was taken by Turk and Pentland [1] in their seminal paper describing the use of eigenfaces to characterize face images. Their method reduced each face to a set of coefficients – each of which was each associated with a basis function called an eigenface. This set of coefficients defined a “vector” that characterized that particular face image. If the lighting conditions and the backgrounds were similar, various pictures of the same person’s face tended to cluster, and could be represented b y an “average” vector that could be used to represent that

1. INTRODUCTION Face perception and recognition is an active area of research, with a wide variety of applications, including mug shot searches, identity verification, assistive devices for blind, and witness face identification. Face recognition algorithms typically rely on some type of similarity metric to compare a reference image to all of the face images in a face database. In order to use such a metric, some type of characterization (indexing) must first be performed on each of the face images. The resulting characterization is typically represented by a set of coefficient values that are associated with each image. The coefficient values associated with a reference image can then be compared to the coefficient values associated with each of the face images in the database, to produce a measure of similarity. While the process of comparing these coefficients is straightforward, choosing the method for generating these coefficients is a challenging task, and many different methods have been used.

person. A distance function could then be used to compare each of these vectors to the vector computed from a new image – thus identifying the person in that image. A later approach called the “graph-matching system” and its successor the “elastic bunch graph-matching system” were described by Wiskott et al [2]. These methods superimpose a standard set of nodes over the face image at specific facial landmarks (called fiducial points) and then focus intensive processing on those points only. (Fiducial points might be the corners of the eyes, the corners of the mouth, etc.) The processing is done with Gabor wavelets of various orientations and spatial frequencies, which is reminiscent of the processing done by hypercolumns in the human primary visual cortex. This processing produces a set of wavelet coefficients which are associated with each point. Face similarity is then evaluated by comparing the corresponding wavelet coefficients. Thus, this method bases its similarity measurement on similarities between the corresponding nodes of the faces in different images. Many other techniques for quantifying the similarity between face images have been used. Samaria et. al. [3] describes a stochastic approach. (A good review of some of the approaches used prior to 1995 can be found in [5].) However, when the Army Research Laboratory sponsored a competition to find the most robust face recognition algorithms, the winners were the graph-matching algorithm and two PCA based algorithms. Subsequently Hancock et al [4] compared the variations in similarity ratings produced b y both of these methods to those judged by humans. They concluded that “Neither computer system does a particularly good job of explaining the human variance.” Perhaps this is because humans tend to classify faces based on high-level criteria. As O’Toole et. al [6] observes terms such as “perky” or “mean-looking” can be helpful t o humans in classifying faces, although humans might be hard pressed to specify exactly what facial feature is used t o perform this classification. In an apparent attempt to explore the use of words to describe faces, the University of Saarland has an online experiment [8] on face perception that can be accessed via the Internet. Every subject is allowed to review a single face and assign a set of words to it from a pre-selected list. They also have similarity ratings that allow a subject t o specify degrees of similarity (1 to 5) for a pair of faces. 3. LIMITATIONS OF EXISTING TECHNIQUES O’Toole et. al. [6] views the basis functions that are used t o reduce a face image to a set of coefficients as a “perceptual front end” and observes that the choice of the basis functions “has strong implications for the efficiency and accuracy with which different face processing tasks can be performed.” In other words, if the set of basis functions fail to capture some important information about the face, any subsequent processing will inevitably be hampered by the lack of that data. The problem is that researchers generally don’t know what information encoded in the high-dimensional face space is important (and what is not) for any given task. For this reason, they are understandably reluctant to deliberately use a set of basis functions that span anything less than the entire space. Even when some information is deliberately discarded, the test of whether the “important” information has been retained is often answered by how accurately the original face image can be reconstructed from the resulting

set of coefficients. This despite the fact that the face processing performed in the human visual system is not done for the purpose of face reconstruction, but to extract the face content relevant to the current task. 4. THEORY Chellappa et. al. [5] states that there are three different kinds of problems that arise in face recognition: face matching (to verify identity), similarity measurements (to help witnesses search electronic mug shot databases), and face transformations (to model aging or to reconstruct a face from remains). Regarding the second kind of problem, Chellappa et. al. observes, “Similarity detection requires... that the similarity measure used by the recognition system closely match the similarity measures used by humans.” For example, it would be very frustrating to a witness to request “more images like this one” and get a set of images that are judged by the system to be similar, but appear to the user to be very dissimilar. Unfortunately, as discussed above, the two most successful algorithms for measuring similarity between face images do not correlate very well with subjective similarity, as evaluated by humans. Thus, a need was felt for development of a new methodology to characterize face images that allows for similarity estimates that more closely parallel those of human perception. This is no easy task i n view of the fact that humans seem to evaluate facial similarity at many different levels simultaneously. The presence of local features (such as a beard or mustache) can be important, but so can the presence of widely dispersed features (such as freckles or acne scars). In addition, the emotional and mental state of the depicted person can greatly alter the appearance of the face. If a similarity metric is to be developed that parallels that of humans, a set of basis functions must be found that can capture facial similarity at all these different levels. Perhaps one reason that the correlation with current methods is less than satisfactory is that humans often evaluate similarity at a rather high level. In fact very high level holistic information is sometimes used. Chellappa et. al. [5] says that aesthetics are also important. Beautiful faces are most quickly recognized, ugly faces are next, and “average” faces are least readily recognized. Distinctive faces are also recognized better and faster than “typical” faces. Unfortunately, humans evaluate the similarities between faces using largely subconscious processes. Because of this, introspection is of limited value in deducing the abstract basis functions used by the human visual system to index faces. Fortunately, some of the most distinctive (and/or most salient) facial characteristics are perceived at the conscious level, and are labeled with words. The fact that these distinctive characteristics have been given names seems t o indicate that they are especially significant, and might be more perceptually salient. Thus, they might be useful for evaluating the perceived similarity of facial images. As O’Toole et. al [6] observes humans seem to find abstract language (such as “long-faced” or “lively”) useful for conveying information about faces, and these words seem t o help us form abstract mental classifications of faces that assist us in distinguishing faces. One method for compiling a list of such words is to study a comprehensive lexicon. When

used as part of a method for indexing face images, such words might be described as lexical basis functions. 5. AN OVERVIEW OF OUR APPROACH In order to compare the similarity ratings of any method t o that of human judgment, it is important to first “calibrate” the pairs of images within an image set based on the similarity perceived by human participants. This could be done by having human participants compare all possible pairs of images within a face database, assigning a similarity rating to each pair. As a practical matter, however, human evaluators would find it very tedious to compare and subjectively rate the similarity between thousands of image pairs. Given the rather long time required to complete such a task, the results might not be very consistent from beginning to end – especially if performed by a single individual. A more practical method is to provide the participant with a “reference image” from a database of face images, and then ask him/her to scan an array of all of the images in the database, selecting those images that are “most similar” t o the reference image. After the participant has selected the most similar images, he/she is then asked to “rank order” those images, based on their relative similarity to the reference image. Using this method we designed an experiment that used frontal face images from the Purdue AR Face database [7]. Each participant was presented with a “reference” image taken from this set of images, and was asked to (1) select the “most similar” faces from the remaining images, and (2) to rank order those images based on their degree of similarity to the reference image. The results of this experiment provide a “ground truth” for evaluating the performance of a similarity measurement algorithm against similarity, as perceived b y the human subjects. To compile a list of lexical basis functions for characterizing faces, we reviewed a comprehensive lexicon of the English language and selected words that could be used to represent distinctive facial characteristics, such as bearded, muttonchops, double-chinned, bushy-browed, bug-eyes, bald, brunette, curly haired, and thin lipped. In doing so, we found that these words could be classified into two general categories: those that characterize local facial features (such as headgear, hair, forehead, eyebrows, eyes, ears, nose, cheeks, philtrum, mouth, teeth, chin, beard, skin, and neckware) and those that characterize holistic facial features (such as age, emotional state, mental state, physical appearance, aggression appearance). In our first experiment we chose t o use the basis functions that characterized local facial features. Each of these lexical basis functions was then associated with a corresponding coefficient value that represented the degree of its presence in a particular face image. All of the coefficient values for a given image were then collectively taken to constitute a vector. The vectors representing two different face images could then be compared to determine the visual similarity of the faces in those images. For example, the dot product of two vectors could be used t o provide a single scalar value representing the similarity. The success of this approach depends upon whether the words chosen as lexical basis functions can adequately represent the multidimensional space of the face images i n the set. More particularly, the question is whether those basis functions adequately represent the facial characteristics perceived by humans when judging similarity. To answer this

question, we conceived an experiment that allowed us t o compare the similarity of the faces as represented by our set of basis functions to the similarity that is subjectively perceived by human observers. 6. THE EXPERIMENTAL PROCEDURE The experiment consisted of two independent procedures that were performed by different participants. Both of these procedures used the same image set, which consists of 7 5 frontal face images. Procedure 1 was performed by 1 7 different participants. Each of these participants was given one of the images from the 75-image set, and was then asked to (1) subjectively compare the visual content of that reference image to all 75 images in the set, and (2) select the 20 most similar images (which included the reference image) and (3) rank order those 20 images based on their relative similarity to the reference image. (Most of the participants repeated this procedure for 5 different reference images.) Procedure 1 was iterated 75 times – in each iteration a different image from the image set was used as a reference image. Thus, the participants chose 20 “similar” images for each of the 75 images in the image set. Procedure 2 was performed by a single participant, and employed a list of 192 words that had previously been selected by the investigators from an English lexicon containing over 230,000 words. These words were chosen based on their perceived usefulness in describing local facial features. The participant was given each of the 75 images from the image set paired with an associated “check sheet” that contained these 192 descriptive words. The participant was then asked to indicate (with a check mark) which (if any) of these words were useful for describing each particular image in the image set. Thus, at the conclusion of Procedure 2 the investigators had a list of descriptive words for each of the 75 images in the image set. 7. RESULTS Of the 192 words originally chosen for content evaluation, only 58 were checked on the check sheets for more than one image (See Table 1). These 58 words were then used to define a 58-element binary vector that represented the content of each of the 75 images. Table 1. The 58 content words used to compute dot product similarities acne adolescent adult bearded beardless big eared blond blue-eyed brown-eyed bucktoothed bushy-browed Caucasian close-mouthed curly-haired dark-skinned eyeglasses eyes front fair-haired freckled freckly full-mouthed hairy-faced hard-mouthed light-bearded maiden man moustache open-eyed open-mouthed pug nose shaved smiling smooth-shaven spectacled stubble sun tanned thin-lipped tight-lipped top-heavy uncollared unshaved wide-eyed young youthful

bareheaded bleary-eyed brunette clean-shaven earringed freckle-faced girl long-haired necklace ruddy snub-nosed thick-lipped tousled woman

Dot products of these vectors were then used to represent the dot product similarity between each pair of images. Fig 1 shows a histogram of the dot product similarity values obtained by applying the word list in Table 1 to the image set.

Fig. 2. The correlation between dot product similarity and subjective similarity 8. DISCUSSION Fig. 1: A histogram of the dot product similarities generated using the word list in Table 1 The dot product similarity values for all the 75 x 75 = 5625 different image pairs were then arranged to form a 75 x 75 dot product similarity matrix. (Note that this matrix i s symmetrical about its diagonal.) Unlike the dot product (which produces an absolute similarity value) the subjective similarity evaluation done i n Procedure 1 (which produces a rank ordering of the 20 most similar images) provides only relative similarity values. Since each reference image was taken from the 75 image set, the most similar image in the set was identical to the reference image, and was assigned a similarity of 75. The other 19 images selected by the human evaluator were then assigned descending values of 74, 73, etc, with the least similar image being assigned a value of 56. The 55 images not chosen by the evaluator were each assigned a similarity value of zero. The resulting set of 75 values (most of which were zeros) was then used to form a single row in another 75 x 75 subjective similarity matrix. This subjective similarity matrix is sparse. The undefined values within this sparse matrix fall within a range from 1 t o 55. However, we cannot be sure of the exact value of each element. (Note that we have arbitrarily assigned a value of zero to all of these undefined matrix elements.) This uncertainty makes the computation of a correlation coefficient between the dot product similarity matrix and the subjective similarity matrix problematic. One method for studying the correlation between the elements of these two matrices is to (1) treat the arbitrarily assigned values of zero as actual values, (2) pair each element of the 75 x 75 dot product similarity matrix with its corresponding element in the subjective similarity matrix t o create 75 x 75 = 5625 pairs, (3) sort each of these pairs into a set of “bins,” based on that pair’s dot product similarity value, and (4) average the subjective similarity values in each bin and plot these averages against the dot product similarity of that bin. Fig 2 shows the results.

Fig 2 shows a good correlation between the dot product similarity and the subjective similarity for the image set, except for a deviation in bins 3 and 4, where the average subjective similarity is higher for a lower dot product similarity. The histogram in Fig 1 shows that the number of samples in each of these two bins is quite small, and thus their respective averages are subject to substantial sampling error. For example, bin 2 contains only 6 pairs (out of a total of 5625 pairs) – 2 of which have non-zero subjective similarity values. Bin 4 contains only 52 samples – 11 of which have non-zero subjective similarity values. Interestingly, 6 of the 13 non-zero (seemingly anomalous) subjective similarity values in these two bins originated with a single experimental participant. 9. CONCLUSION This paper has described research conducted to test the hypothesis that the facial features represented by a set of lexical basis functions would be useful for evaluating the similarity of face images in a manner that parallels the subjective judgment of human evaluators. This research has (1) defined a set of face features as represented by a list of lexical basis functions, (2) annotated a set of images, based on this list of basis functions, (3) determined an ordered set of the “most similar” images for each of the images within a set, based on subjective evaluations by human participants, and (4) measured the usefulness of the lexical basis functions in gauging the similarity of the images within the set. The overall correlation in Fig 2 indicates that the dot product similarity measures derived from this set of lexical basis functions provides a useful measure of similarities within this images set, not just for choosing the most similar images, but also for ordering those images, based on their similarity to the reference image. If a face similarity algorithm could be designed to detect the presence of the local facial features represented by these lexical basis functions, it would have a similarity measure for this image set that would parallel the judgment of the human participants in our experiment.

10. BIBLIOGRAPHY [1] M. Turk, A. Pentland, “Eigenfaces for recognition”, Journal of Cognitive Neuroscience, 3(1), 1991, pp. 71 – 86. [2] L. Wiskott, J. Fellous, N. Kruger, C. Von der Malsburg, “Face recognition by elastic bunch graph matching”, IEEE Trans. on PAMI, 19(7), 1997, pp. 775 – 779. [3] F.S. Samaria, A.C. Harter, “Parameterisation of a stochastic model for human face identification”, Proceedings of the Second IEEE Workshop on Applications of Computer Vision, 1994, pp. 138 –142. [4] P. Hancock, V. Bruce, M. Burton, “A comparison of two computer-based face identification systems with human perceptions of faces”, Vision Research, 38(15-16), 1998, 2277 – 2288. [5] R. Chellappa, C. Wilson, S. Sirohey, “Human and machine recognition of faces: A Survey”, Proceedings of the IEEE, 83(5), 1995, 705 – 740. [6] Alice J. O’Toole, Herve Abdi, Kenneth A Deffenbacher, Dominique Valentin, “A perceptual Learning Theory of the Information in Faces” Cognitive and Computational Aspects of Face Recognition London Routledge, pp. 159-182 [7] A.M. Martinez and R. Benavente. The AR Face Database. CVC Technical Report #24, June 1998 [8] www.uni-saarland.de/fak5/ronald/Experim/Similar/SimInstr_e.htm

Suggest Documents