Cluster sampling also groups the elements in the population, but there's no need to identify ... Cluster sampling first partition the universe into K clusters and treat.
Sampling Design For Face Recognition Yanjun Yan and Lisa A. Osadciw EECS, Syracuse University, Syracuse, NY, USA {yayan, laosadi}@syr.edu ABSTRACT A face recognition system consists of two integrated parts: One is the face recognition algorithm, the other is the selected classifier and derived features by the algorithm from a data set. The face recognition algorithm definitely plays a central role, but this paper does not aim at evaluating the algorithm, but deriving the best features for this algorithm from a specific database through sampling design of the training set, which directs how the sample should be collected and dictates the sample space. Sampling design can help exert the full potential of the face recognition algorithm without overhaul. Conventional statistical analysis usually assume some distribution to draw the inference, but the design-based inference does not assume any distribution of the data and it does not assume the independency between the sample observations. The simulations illustrates that the systematic sampling scheme performs better than the simple random sampling scheme, and the systematic sampling is comparable to using all available training images in recognition performance. Meanwhile the sampling schemes can save the system resources and alleviate the overfitting problem. However, the post stratification by sex is not shown to be significant in improving the recognition performance. Keywords: Face recognition, sampling design, simple random sampling, systematic sampling, appearance based methods, DF-LDA, ROC “mist”
1. INTRODUCTION A face recognition system consists of a face recognition algorithm and the classifier and features selected by the algorithm from a data set. The face recognition algorithm is a general framework guiding the selection of features and sometimes the construction of the classifier from a specific database. The features and classifier found by the algorithm provide a test statistic to decide whether a new face is from a known person or not. The test statistic is pragmatic based on the database, and it can be regarded as a hypothetical random variable with specific testing power β and level of significance α. The face recognition algorithm plays a central role since it’s piloting the finding of features and the constructing of the classifier. People tried to evaluate a particular algorithm given a data set such as in paper [1–8]. If the data are abundant enough, the data set can be divided into a training set and a validation set. The training set is used for the algorithm to find the features and classifiers. The validation set is used to evaluate whether the features and the classifier work well on new data. Some tuning may be also implemented to sophisticate the features or classifier in the validation process. If the data are not abundant, Cross Validation is widely utilized, which divides the data set into several chunks; in each run, leave only one chunk out for validation; repeat this process until all chunks are used up, and then use the average performance as a metric to evaluate the algorithm on this specific data set. A face recognition algorithm with better average performance can not always ensure that the particular features found by the algorithm is always better off. This is common that a face recognition system with the same classifier design but the specific features derived from certain training set may yield different identification rate for different face databases or implementation situations. This is usually regarded as generalizability problem. The generalizability can be evaluated by the identification rate on the validation set. The evaluation of a face recognition algorithm is definitely important, because the algorithm is the general and abstract framework and it can be transferred everywhere, but with the development of the face recognition techniques, many comparisons have been done for this topic and several algorithms such as DF-LDA are proved to be robust and fast. Our aim is to make the abstract algorithm work best for a particular population or situation practically by selecting the training set in the best way.
Cross validation aims at evaluating an algorithm, but not evaluating the features selected by this algorithm. As a by-product, for each combination of the training set and validation set, the performance is evaluated, from which the average performance can be calculated and the performance for each training set can be ordered. However, this ordering hasn’t been brought into further attention, and this paper tries to help the face recognition algorithm find the best way to select the training set and the final features. Therefore, given a face recognition algorithm, instead of evaluating an algorithm, is there a way to find the optimal training set and then the best feature set to attain the best performance of the algorithm? The performance is defined as a weighted sum of the identification rate on the training set and on the validation set, which trades off between the error reduction in training and the generalizability. Without intervening the algorithm, the control over the feature selection is the construction of the training set for the registered users. Papers usually state that how many images are used in deriving the features from the available images of registered users and how many images are used in testing or validation such as in paper [9,10], but no paper has discussed how the training set should be constructed. The aim of this paper is to design a sampling scheme to construct a training set, which hopefully contains maximum information that the algorithm can obtain from the training process to improve the selected features. By sampling design, the face recognition system can help exert the full potential of the face recognition algorithm for a database or generally for a population. In a descriptive but not accurate analogy: in cross validation iteration, the chunks can be divided by different methods; for each division method, the identification rate for leaving each chunk out can be evaluated and sorted, thus the average and the variation of the performance for this chunk division method can be obtained. Similarly in this paper, from all available face images of registered users, the goal of sampling design for face recognition is to select a method to construct a training set, which can yield better results most of the time to optimize the performance of the face recognition algorithm. This analogy is not accurate, and the differences between the cross validation and the sampling design are listed below. 1. The training set selection is not completely arbitrary in practice as in the chunk division in cross validation. The registered users are usually a fixed group of people, and the face recognition system should retain the common features of this particular group of users. Thus the training set is taken from this special group but not arbitrary. 2. The cross validation process usually uses much more training images than validation images. But in practice, the registered users usually enroll in the system with limited images (as training set) and use this system many times later (as validation set). Therefore the generalizability should be emphasized. 3. Having explained the goal of the sampling design, it should be emphasized that the sampling design is a statistical technique that tries to find a better feature set. There’s still the possibility that a not so good feature set will be constructed by this sampling method, but the possibility to construct a better feature set is enhanced. The advantage of sampling design is to exert the full potential of any given face recognition system without overhaul, since the sampling design is implemented in the training set construction stage.
2. CLASSIC SAMPLING DESIGNS Conventional statistical analysis usually assume some distribution of the sample as if the assumption naturally holds. Only according to the assumption of a hypothesized model, the statistical inference can be drawn. In addition, given a sample, there’s no way to tell how the sample is collected. The missing step is the sampling design, which directs how the sample should be collected and dictates the sample space [11]. The design-based inference does not assume any distribution of the data and it does not assume the independency between the sample observations. The sampling design treats every observed data as a fixed quantity without measurement error, because the major ‘uncertainty’ in sampling design is from which elements of the population are picked into the sample, which overwhelms the measurement error, if any. Sampling design is different from conventional statistics, thus some definitions and different design schemes are introduced below.
2.1. Some Definitions 1. Population: Population is the set of attributes that we are interested in. In sampling design, the population has to be clearly defined. In face recognition system, the face images of all potential users constitute the population, which is practically infinite. In simulation, the population is the set of all face images in the database, which is finite. For one person, there are usually multiple images with different pose, expression or illumination, and these images constitute a cluster for that person. 2. Frame: A frame is a representation of the population that allows us to gain access to the elements so that we can collect sample. For example, the name list of all registered users is a frame for us to collect the users’ face images. 3. Sample: A sample is a subset of the population with size n. All the face images used for training constitute a sample for the face recognition algorithm to learn how the population should be. Given all the available face images from the registered users, the selected images used for training constitute a sample. This can be regarded as a second stage sampling if the database collection is the first stage sampling.
2.2. Simple Random Sampling Simple random sampling randomly collects n elements out of the N elements from the population, which is compatible with most statistical analysis and mathematically trackable. In simple random sampling µ conventional ¶ N design, there are possible samples in the sample space. n The advantages of simple random sampling include: 1. It’s easy to implement if a list frame is available. A random number generator can be utilized to realize the roulette wheel selection of n elements without replacement. 2. It approximately satisfies the sampling model, on which conventional statistical analysis are based, such as t-test, Chi-square-test and regression. 3. The analysis for simpler random sampling is simple, and it is easy to correct for errors or other problems in selecting the sample, thus we can be more confident that the result is correct in this context. On the contrary, some complex techniques may be plausible but harder to trace the error. 4. It’s easy to combine the simple random sampling with more sophisticated classifiers. On the other hand, there are also several disadvantages of the simple random sampling: 1. It may be inefficient (with higher variance) relative to other more complex sampling designs. 2. It will usually result in small sample sizes for rare classes because it samples classes in proportion to their representation in the population.
2.3. Stratified Sampling Stratified sampling first partition the population into a set of H strata. Strata must be defined prior to selecting the sample thus it’s able to identify every element of the population with a single stratum without ambiguity or overlapping. Each stratum is sampled independently from all other strata. The stratifying principle states that the elements in a stratum should be as homogeneous as possible, and the strata should be as heterogeneous as possible. By dividing the population in this way, the total variance of the estimation can be lowered. For example, the face database can be first partitioned into male and female strata thus each face belongs to either one of the stratum. Each stratum is sampled independently and the final result is a combination of the two strata. Alternatively, the strata could also be partitioned by expression such as neutral, smiling, angry or sad, but the boundary between different expressions has to be clearly defined to avoid any ambiguity to allocate every face image into only one stratum. The advantages of stratified sampling include:
1. Because the sample size in each stratum can be controlled, the variance of the estimates computed within strata can be controlled. Thus it’s able to get an adequate sample for a specified level of precision even if the stratum consists of rare elements. 2. It provides administrative convenience for implementing the sample. For example, one database are collected mainly from American people, another database are collected mainly from Asian people, these two databases are naturally stratified and they can be united in scaling up the face recognition system. On the other hand, there are also several disadvantages of the stratified sampling: 1. Frame development may be expensive since it’s needed to identity all elements in the universe to a stratum prior to selecting the sample. 2. Gains in precision is due to the stratifying principle. If the elements in the strata are not homogeneous, the gains due to stratification may not be large.
2.4. Cluster Sampling Cluster sampling also groups the elements in the population, but there’s no need to identify each element in the universe as in stratified sampling. Cluster sampling first partition the universe into K clusters and treat each cluster as a random variable at a higher level. The clustering sampling principle states that the elements within a cluster should be as heterogeneous as possible, and the clusters in a whole should be as homogeneous as possible. The K clusters are the primary sampling units, psu’s. Out of the K clusters, k clusters will be sampled. Elements within a cluster are called secondary sampling units, ssu’s. The idea behind the cluster sampling is to treat a group of elements as a new random variable and the ssu’s can be further divided into multi-stage cluster sampling. Cluster sampling differs from the stratified sampling in the following aspects: 1. Stratified sampling takes a subset of elements in every stratum, and all stratum have to be sampled. Cluster sampling takes all units in some clusters, but only a subset of clusters has to be sampled. 2. Stratified sampling groups similar items into strata, the elements are usually not physically close. Cluster sampling groups dissimilar items into clusters, but the elements are usually physically close. The advantages of cluster sampling include: 1. It’s cost efficient in cluster sampling, since the elements in a cluster is usually physically close. For instance, the face images of a single person constitute a cluster, and these images are usually enrolled into the system at the same time, then the sampling is concentrated during that duration, and the sampling cost is low. 2. If within cluster (psu) heterogeneity exists and psu’s have equal numbers of ssu’s, cluster sampling will be very precise. On the other hand, there are also several disadvantages of cluster sampling: 1. Cluster sampling principle requires that the elements within a cluster should be as heterogeneous as possible, but usually the ssu’s within a cluster are similar, so cluster sampling may not be efficient. 2. Cluster sampling does not satisfy the assumptions of standard statistical analysis such as the hypothesis tests because observations within a cluster usually are not independent.
2.5. Systematic Sampling Systematic sampling first selects a sampling interval S, and select a random start, s, from the numbers from 1 to S. Each number has equal probability 1/S to be selected. Then take the sth element of the universe as the first observation, and then every Sth element after that into the sample until the end of the population. Then the size of the sample n ≈ N/S. If the list frame is random enough, the systematic sampling can ensure the separation of the observations in the sample. There are S samples in the systematic sample space, which is usually smaller than the simple random sampling sample space. If each possible sample is regarded as a cluster, since the elements in each sample will be sampled or not always together, the systematic sampling can be regarded as a K = S, k = 1 cluster sampling. Thus the systematic sampling creates a partition of the population much like that of cluster sampling. The advantages of systematic sampling include: 1. It’s practical and convenient to implement. 2. It often provides better precision than simple random sampling. 3. It provides better spatial distribution of sample than simple random sampling. On the other hand, there are also several disadvantages of systematic sampling: 1. There are no generally satisfactory variance estimator, so the variance must be approximated. 2. There’s the potential of poor precision if population has periodicity and the sampling interval is in phase with this periodicity.
3. SAMPLING DESIGN IN FACE RECOGNITION SYSTEM For a face recognition system, the population is the union of face images from the registered users and the unregistered imposters. The frame is usually a name list of all the images. Once the set of registered users and the set of training images for each user are determined, the face recognition system may process all the available images altogether. But with the continuous progress in enlarging the face database, the number of available images used to train a face recognition system is huge, and certain sampling scheme can be utilized to shrink the training set at the beginning to find the most representative images of the registered users.
3.1. Combination of several sampling schemes Following the discussion in section 2, the population of face images can be sampled in the following levels: 1. The face images can be stratified by the standards such as database collection, female/male, age, or expressions. The stratification principle states that the elements in the strata should be homogeneous, and the strata should be heterogeneous. But the stratification by one standard may not be good for another standard, thus the post-stratification analysis may be more preferable to see how the stratification may affect the classification results. Namely, after the classification without stratification, the samples can be grouped by one standard at a time. By each standard, the variation within and between strata can be provided from grouping the overall results to see which standard is more suitable for stratification or whether the stratification is needed. 2. Within the stratum, the face images for the same person constitute clusters naturally. The cluster principle states that the elements in the cluster should be heterogeneous, and the clusters should be homogeneous. The homogeneity of clusters is consistent with the homogeneity requirement of the elements of stratification. 3. Within the cluster, the set of all possible face images of the person represents the variation of this person. The heterogeneity is consistent with the heterogeneity requirement of the elements of clustering. Within the cluster, if the available images are taken from a series of video images, the correlation between the images is usually high, thus not all images are needed for training the face recognition system. A simple random sampling or systematic sampling can be used to select the cluster elements.
The sampling level is shown in Figure 1, which is a tree like structure with face images as leaves. The highest level is stratification. The middle level is clustering for each user. In The lowest level the face images can be sampled by simple random sampling or systematic sampling, if needed.
Face Image Population
Stratification
Stratum 1
Stratum 2
Stratum S
Clustering
Person 1
Person 2
Person K
Availabe Images
Image 1
Image 2
Image P
Figure 1. Sampling scheme for face recognition system. The highest level is stratification. The middle level is clustering. The lowest level can be sampled by simple random sampling or systematic sampling.
When the enrolling images are highly correlated such as in video sequence case, instead of using all enrolling face images for training, random sampling as shown in Figure 1 can be implemented to shrink the training set, which is helpful not only in computation resources and time, but also in alleviating overfitting.
3.2. Comparison of using all or a subset of images for training When the face images per person are not abundant, resampling techniques such as bootstrapping can be utilized to construct more samples for training [12]. But when the face images per person are more than enough or the number of users are far exceeding the feature dimension, the sampling techniques can be utilized to shrink the training set as discussed in this paper. Another application of sampling techniques is illustrated in paper [13], which used the random sampling to select the fisherface and null space LDA features to overcome the overfitting problem. However, this paper is focused on sampling design in constructing the training set at the image level, but not the feature level. In the feature extraction phase, not all clusters (each cluster is corresponding to one person) may be used; therefore after the features are extracted, all clusters should be projected onto the feature space to construct a template for each cluster (corresponding to each registered user). When all the available training images from the database are used, the ROC curve (Receiver Operating Characteristic curve) is a fixed line since the features and distributions of the scores are fixed if the algorithm is fixed. Each point on the ROC curve represents an operating condition with a specific threshold and everything else in the decision rule is the same. However, when the available training set is not completely used, but only a subset is used as representative images, the ROC curve is no long a “curve”, but a set of scatter points like a “mist”, where each point in the “mist” represents an operating condition with a specific sampling of the training
set and the determination of the threshold is by a fixed procedure without variation. Some figures of the ROC “mist” are illustrated in section 5. From the practical view of point, comparing to the set of all possible captures of the face images, the set of available images is never complete, so a practical ROC curve is nothing but a random realization of the underlying ROC curve determined by the “true” distribution of the similarity scores for registered users and imposters. This discussion on ROC “mist” should hopefully bring some attention to the effect of randomness in constructing the training set.
4. BRIEF ON FACE RECOGNITION ALGORITHM Though the algorithm is not the focus of this paper, the DF-LDA by Euclidean distance method is chosen for its better performance than D-LDA [14–16], Fisherface [17] and Eigenface [9, 18] methods. These appearance based methods may be unified into LaplacianFace method [19]. The DF-LDA by Euclidean distance method is based on the DF-LDA method proposed in [20] with some changes in the definition of distance, and it is briefly introduced below. The optimization function used in DF-LDA is Φ = arg max Φ
ΦT Sˆb Φ ΦT Sˆb Φ + ΦT Sw Φ
(1)
where Sˆb is a weighted version of between class scatter, Sw is the within class scatter, and Sˆt = Sˆb + Sw is a modified total scatter. DF-LDA method first finds Sˆb,range , and projects Sˆt onto it and find the features to minimize Sˆt , but not to find the null space of Sˆt , since it’s shown that the projected Sˆt is nonsingular. Paper [16] used a modified version of Sˆb as shown below other than the common definition of between class scatter, and the modified version is efficient to calculate Sˆb,range by SVD. r c c X X p n k Sˆb = w(dkl )(µk − µl ) · N k=1
l=1,l6=k
r nk N 4
=
c X
T c X p w(dkl )(µk − µl ) (2)
l=1,l6=k
ωk ωkT
k=1
= (ω1 , ω2 , . . . , ωc ) (ω1 , ω2 , . . . , ωc )
T
4
= ΩΩT where the quantities defined in the intermediate steps are self explanatory. Sˆb,range can be constructed from the SVD on Ω directly: Ω = Q · Λb · P T (3) Sort the diagonal matrix Λb to make its diagonal values from high to low. Sˆb,range = span(Q(:, 1 : c − 1))
(4)
Γ1 = Q(:, 1 : c − 1)Λb (1 : c − 1, 1 : c − 1)
(5)
Let as the projection onto Sˆb,range . Project Sˆt onto Sˆb,range . Sˆtt = ΓT1 Sˆt Γ1
(6)
where Sˆtt is a much smaller matrix than Sˆt , and paper [16] proves that Sˆtt is nonsingular, thus the eigen decomposition of Sˆtt is unique and mathematically easy. Sˆtt · E = Λtt · E
(7)
Sort the diagonal matrix Λtt to make its diagonal values from low to high. Suppose that the eigenvectors associated with the smallest m eigenvalues are retained, then the features to minimize ΦT Sˆtt Φ are: Sˆtt,small = span(E(:, 1 : m))
(8)
Let Γ2 = E(:, 1 : m)Λtt (1 : m, 1 : m) (9) T ˆ ˆ as the projection matrix onto the partial space in Stt to minimize Φ Stt Φ. Then the projection matrix to project the raw image vectors onto the feature space is as follows. ΦDF −LDA = Γ1 · Γ2
(10)
In above method, the diagonal eigenvalue matrices are all multiplied back to the eigenvectors in constructing the projection matrix, which is corresponding to using the Mahalanobis distance. However, the classification performance is not satisfactory on ORL and UMIST databases. After checking the features derived from DFLDA method, the problem is found out that the major features are basically nulled out due to the extremely small eigenvalues, which is the reason why the DF-LDA method didn’t provide satisfactory classification results. Based on this analysis, the distance is modified by Euclidean distance without the multiplication of eigenvalues. Namely, instead of using (5), (9), the following formulas are used. Γ1 = Q(:, 1 : c − 1)
(11)
Γ2 = E(:, 1 : m)
(12)
The feature space is constructed by above projection, and the distances between testing images and the templates are evaluated in the feature space to determine the similarity score for decision.
5. SIMULATION RESULTS The first task in simulation is to check whether the random sampling scheme can save the system resources and alleviate the overfitting problem. The second task is to compare the simple random sampling and systematic sampling in constructing the clusters. The third task is to try the post-stratification technique to see whether the weighting by stratification is helpful in promoting the recognition rate overall. Once the sampling design is determined, the sample space can be constructed accordingly, and the properties of the samples in the sample space can be analyzed. Simply due to the consideration to work on a manageable sample space, the relatively small databases, ORL [21] and UNIST [22] face databases, are used in the simulation, where the available images are 400 and 1012 respectively, then after the division of registered user set and imposter set, the training and testing sets for the registered users are even smaller, thus the subset of the training set is manageable.
5.1. Comparison of complete set and random subset A complete training set contains the maximal information that a face recognition system can utilize, but the problems of using a complete set include • When the image is big and the dataset is huge, the vectorized training images constitute a huge matrix, the direct manipulation of such a huge matrix may be impractical, since current computers can usually handle a matrix of up to 232 ≈ 4G entries due to the limitation in the assignment of an address, even by virtual memory. Stepwise algorithms such as the principal component neural networks [23] may be utilized to implement PCA or other similar eigen decomposition manipulations by feeding the images sequentially, but the expense and complexity in calculation may not be justified by the asymptotic performance of such stepwise algorithms. However, without such stepwise algorithms, the eigen decomposition of a huge matrix is just impossible.
• Meanwhile, when the available images are abundant, overfitting can occur. PCA is the optimal method to pack the information in as few features as possible, but depending on the training set, the base features or the principal components are not fixed, thus the feature derivation depends a lot on the training set. If the training set contains outliers, the finally derived features may overfit the training set and they won’t work well on new testing images or imposter images. Therefore, a subset of all available training images may be utilized as representative images for training. After the features are determined, the whole training set is projected onto this feature space and the templates for each registered users can be constructed for comparison in testing later on. The random selection of the subset is by sampling design as discussed in this paper. The comparison of using all available training images to using a subset by simple random sampling (SRS) or systematic sampling on the ORL database is shown in Figure 2. The comparison of using all available training images to using a subset by SRS or systematic sampling on the UMIST database is shown in Figure 3. The randomization is on the division of training set (gallery), testing set(probe) and imposter set. Now that the ORL and UMIST databases are relatively small, the performance by using all available images are better and stabler than using a subset. However, when the training set is huge, problems will occur as listed above. Here the randomization is on the division of training set, the two sampling schemes can be compared on the same page. From the Figures 2 and 3, it’s noticeable that the SRS spreads much more than the systematic sampling, and the SRS tends to have worse false acceptance rate (FAR) and false rejection rate (FRR). Meanwhile, though the spread of the ROC “mist” for the systematic sampling is a little wider than using all training images, their coverage is comparable. Considering the comparable performance between the systematic sampling and using all available training images, and the saving in calculation by the systematic sampling, the systematic sampling scheme is preferred in constructing the training set. Mist of ROC while using all or subset of training set on ORL
Mist of ROC while using all or subset of training set on UMIST
35
90 All Subset by SRS Subset by Systematic
All Subset by SRS Subset by Systematic
80
30 70 25 60 20 FAR
FAR
50
40
15
30 10 20 5 10
0
0
5
10
15 FRR
20
25
30
Figure 2. ROC “mist” comparison of using all and subset for training on ORL database
0
0
5
10
15 FRR
20
25
30
Figure 3. ROC “mist” comparison of using all and a subset for training on UMIST database
5.2. Comparison of SRS and Systematic Sampling in Practical Setting Given N available images for the training µ set,¶a subset of n(< N ) images constitute a sample by the simple N random sampling (SRS), thus there are elements in the sample space. Trading off between the sample n space size and the calculation efficiency, the n is selected as N/2. The sample space size is 1.3785 × 1011 and 1.2641 × 1014 respectively for ORL and UMIST databases, which are still big numbers, thus Monte-Carlo simulation is implemented by randomly selecting 1000 samples from the sample space and evaluate the F AR versus the F RR.
Corresponding to n = N/2, the systematic sampling takes the period K = 2, and there are only 2 elements in the sample space, which won’t show much randomness information in the systematic sampling. Therefore, instead of fixing the N available images as in SRS, the N available images are varied, which is corresponding to vary the division of the sets of registered users and imposters. Anyway in practice there’s no prediction on who’s going to be a registered user, so this randomization is justified. Such randomization is implemented 100 times also by the idea of Monte Carlo simulation. The comparison result on the practical ROC “mist” on the ORL database is shown in Figure 4. The comparison result on the practical ROC “mist” on the UMIST database is shown in Figure 5. The pattern of both databases looks similar, but since UMIST database contains more pose variation than the ORL database, when the same face recognition algorithm is implemented, the FAR is higher on UMIST than on ORL, and the FRR increases a little on UMIST from ORL as well. The “vertical line” pattern of the ROC points is due to the limited number of face images in the subset of training set, where FRR or FAR is the ratio of the falsely rejected or accepted face image numbers over the total number of registered user images or imposter images, thus FRR and FAR can only be some of the limited fractional numbers. Mist of ROC while using partial training set as representative images on ORL
Mist of ROC while using partial training set as representative images on UMIST
30
80 SRS Systematic
SRS Systematic 70
25 60 20
FAR
FAR
50
15
40
30 10 20 5 10
0
0
2
4
6
8
10 FRR
12
14
16
18
20
Figure 4. ROC “mist” by sampling design on ORL
0
0
5
10
15 FRR
20
25
30
Figure 5. ROC “mist” by sampling design on UMIST
Comparing the SRS and systematic sampling, on both databases, the systematic sampling tends to have smaller FAR, but sometimes a little bit higher FRR; and the variation of systematic sampling performance is relatively smaller. This can be explained by the way that the images are collected and stored in the database. Now that the directory structure of a file system is usually ordered by names, and the same subject shares the same prefix in the file names, thus a systematic sampling can sample the images very uniformly. When the face images are sampled uniformly, every subject can influence the final features to some extent, thus the templates constructed in this way is more accurate than without using any information of certain subjects. Therefore the FAR is lowered. However, FAR and FRR are two contradictive metrics, and sometimes the FRR is increased a little. Overall, the systematic sampling seems to work better than simple random sampling for a name based directory face databases. The ROC “mist” of SRS and systematic sampling both appeared in section 5.1 and in this section 5.2, and the difference is that the randomization for SRS is different, but the randomization for the systematic sampling is the same. In section 5.1, the ROC “mist” of SRS scatters due to the variation in the division of the sets of registered users and imposters. But in section 5.2, the ROC “mist” of SRS scatters only due to the huge sample space of SRS, and the division of the sets is fixed.
5.3. Post-Stratification Experiment The stratification needs external information to specify the strata exclusively and uniquely, and once the stratification standard is changed, the stratification has to be done altogether again, where different stratification
standards can not be transformed freely, so the stratification is not done at the sampling stage, instead, the post stratification is implemented. Post-stratification is a method used to match the current subgroup distributions with that of the population estimates from another outside data source, which is usually supposed to be more reliable [24]. The natural choice of a stratification standard in face recognition is the sex. ORL database contains 40 subjects and UMIST database contains 10 subjects, but there are only 4 females in each database. In practice, it is usually assumed that there should be comparable female and male registered users, so the calibration proportion for both sexes is equal: P0 = P1 = 50%, (0: female, 1: male). Depending on the actual selection of images, the pragmatic proportion of each stratum differs, and the pragmatic distribution usually differs from the assumed distribution, thus re-weighting is needed to count in this difference. In the survey setup where the post stratification is used most widely, the statistics in concern are usually additive quantities, thus the weights can be applied on the statistics directly. However, in face recognition setup, the features are linear combinations of raw images from both strata, thus the weights can not be applied to the features directly. The PCA derivation depends on each variable’s variance. After the mean image is extracted, if there’s no variance normalization, the image intensity reflects the variance. Therefore the weights can be applied on the mean-extracted vectorized face images to influence its variance. Suppose that the pragmatic proportions of the female and male subject images are respectively r0 and r1 , then the weight on each strata is Pi /ri , i ∈ {0, 1}. For a given randomized division on the sets of registered users and imposters, experiments show that even though the final features, coordinates for each image in the feature space, and the templates differ with or without the post stratification, the similarity score comes out identical either with or without the post stratification, thus the recognition rate doesn’t change, either. The experiment shows that the stratification by sex is not significant in improving the recognition performance, and the given face recognition algorithm does not depend on the relative intensity of each individual image.
6. CONCLUSION This paper discussed the random sampling design methods for face recognition on how a informative subset of training images should be selected. This paper also emphasized the random nature of practical results by the ROC “mist”. Based on the simulation on ORL and UMIST databases, the systematic sampling scheme is shown to have smaller FAR and FRR than the simple random sampling (SRS) scheme, and the systematic sampling subset is comparable to using all available training images in recognition performance; therefore, the gain of systematic sampling in saving the system resources is preferable. However, the post stratification by sex is not shown to be significant in improving the recognition performance. In the future, experiments can be implemented on a bigger database, by more sampling schemes and by more stratification standards. Other methods to calibrate the weights in post stratification should also be explored. Given current results, the proposed sampling design can help exert the full potential of any face recognition algorithm without overhaul.
REFERENCES 1. B. Heisele and T. Koshizen, “Components for face recognition,” in Proceedings of the 6th International Conference on Automatic Face and Gesture Recognition, pp. 153–158, (Seoul, Korea), 2004. 2. J. Beveridge, K. She, B. Draper, and G. Givens, “A nonparametric statistical comparison of principal component and linear discriminant subspaces for face recognition,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 535–542, (Kaui, HI, USA), Dec. 2001. 3. B. Heisele, P. Ho, J. Wu, and T. Poggio, “Face recognition: component-based versus global approaches,” Comput. Vis. Image Underst. 91(1-2), pp. 6–21, 2003. 4. R. Gross, S. Baker, I. Matthews, and T. Kanade, “Face recognition across pose and illumination,” in Handbook of Face Recognition, S. Z. Li and A. K. Jain, eds., Springer-Verlag, June 2004. 5. J. Lu, K. Plataniotis, and A. Venetsanopoulos, “Regularized discriminant analysis for the small sample size problem in face recognition,” Pattern Recognition Letter 24, pp. 3079–3087, December 2003.
6. H. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence 20, pp. 23–38, January 1998. 7. R. Xiao, M. Li, and H. Zhang, “Robust multi-pose face detection in images,” IEEE Transactions on Circuits and Systems for Video Technology 14, pp. 31–41, Janurary 2004. 8. S. Z. Li and Z. Zhang, “Floatboost learning and statistical face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence 26, pp. 1112–1123, September 2004. 9. R. Brunelli and T. Poggio, “Face recognition: Features versus templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence 15, pp. 1042–1052, October 1993. 10. P. A., M. B., and S. T., “View-based and modular eigenspaces for face recognition,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’94), (Seattle, WA), June 1994. 11. R. H. Green, Sampling Design and Statistical Methods for Environmental Biologists, John Wiley & Sons, New York, May 1979. 12. X. Lu and A. Jain, “Resampling for face recognition.” 13. X. Wang and X. Tang, “Random sampling lda for face recognition,” in 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), 2, pp. 259–265, 2004. 14. H. Yu and J. Yang, “A direct lda algorithm for high-dimensional data with application to face recognition,” Pattern Recognition 34, p. 20672070, 2001. 15. L. Chen, H. Liao, M. Ko, J. Lin, and G. Yu, “A new lda-based face recognition system which can solve the small sample size problem,” Pattern Recognition 33, pp. 1713–1726, 2000. 16. J. L. K. N. Plataniotis and A. N. Venetsanopoulos, “Face recognition using lda-based algorithms,” IEEE TRANSACTIONS ON NEURAL NETWORKS 14, pp. 195–200, JANUARY 2003. 17. P. N. Belhumeur, J. ao Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence 19, pp. 711– 720, July 1997. 18. M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neuroscience 3(1), pp. 71–86, 1991. 19. X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition using laplacianfaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence 27, pp. 328–340, March 2005. 20. L. Juwei, Discriminant learning for face recognition. PhD thesis, University of Toronto, 2004. 21. F. Samaria and A. Harter, “Parameterisation of a stochastic model for human face identification,” in 2nd IEEE Workshop on Applications of Computer Vision, (Sarasota (Florida)), December 1994. 22. D. B. Graham and N. M. Allinson, “Characterizing virtual eigensignatures for general purpose face recognition,” in Face Recognition: From Theory to Applications, H. Wechsler, P. J. Phillips, V. Bruce, F. FogelmanSoulie, and T. S. Huang, eds., NATO ASI Series F 163, pp. 446–456, Computer and Systems Sciences, 1998. 23. K. I. Diamantaras and S. Y. Kung, Principal component neural networks: theory and applications, John Wiley & Sons, New York, NY, USA, 1996. 24. C. S, D. R, and G. H., “Estimation procedures in the 1996 medical expenditure panel survey household component,” in MEPS Methodology Report No. 5, AHRQ Pub. No. 99-0027, (Rockville, Maryland: Agency for Health Care Policy and Research), 1999.