convolutional neural networks (CNNs) have been used in an attempt to solve ... Age estimation, ConvNet, Deep learning, Pretrained ConvNet, Partial least ... is the process of parameterising the face with a view to defining an efficient descriptor. ... anthropometric models, statistical models, local binary patterns (LBP), and ...
ISBN: 9978-989-8533-66-1 © 2017
CONVNET FEATURES FOR AGE ESTIMATION Ali Maina Bukar and Hassan Ugail Centre for Visual Computing, University of Bradford,, Richmond Road, Bradford, BD7 1DP, UK
ABSTRACT Research in facial age estimation has been active for over a decade. This is due to its numerous applications. Recently, convolutional neural networks (CNNs) have been used in an attempt to solve this age old problem. For this purpose, researchers have proposed various CNN architectures. Unfortunately, most of the proposed techniques have been based on relatively ‘shallow’ networks. In this work, we leverage the capability of an off-the-shelf deep CNN model, namely the VGG-Face model, which has been trained on millions of face images. Interestingly, despite being a simple approach, features extracted from the VGG-Face model, when reduced and fed into linear regressors, outperform most of the state-of-the-art CNNs. e.g. on both FGNET-AD and Morph II benchmark databases. Furthermore, contrary to using the last fully connected (FC) layer of the trained model, we evaluate the activations from different layers of the architecture. In fact, our experiments show that generic features learnt from intermediate layer activations carry more ageing information than the FC layers. KEYWORDS Age estimation, ConvNet, Deep learning, Pretrained ConvNet, Partial least squares regression
1. INTRODUCTION The human face is an important biometric because it carries a vast amount of information about an individual. Advantageously, it is simple and easy to capture an image of the face, even when the subject of interest is uncooperative. Faces are used as cues for recognising identities (Parkhi et al. 2015), kinship (Xia et al. 2012), underlying emotions (Rahulamathavan et al. 2013) and even diseases (Cuendet et al. 2016). Most importantly, the structure of the human face is highly indicative of age (Fu et al. 2010). Research in automatic facial age estimation (AFAE) has been active for over a decade. Being a demographic attribute of the human face, AFAE has several real life applications including demographic studies, multi-cue identification, access control, surveillance, targeted advertisements and human-computer interaction systems. Similar to face detection and recognition, facial age estimation is obstructed by several factors such as head pose variation, occlusion, facial expressions, illumination variation and clutter background, to mention but a few. Yet, it is also challenged by other internal and external factors including gender, genes, health and lifestyle. Hence, several approaches have been documented in the literature to circumvent these problems (Fu et al. 2010). Traditionally, age estimation has been achieved via a vital two-step procedure, consisting of feature extraction and pattern learning (Bukar et al. 2016; Huang 2009). As an initial mechanism, feature extraction is the process of parameterising the face with a view to defining an efficient descriptor. Several researchers focused on this concept, thereby devising numerous feature extraction methods including but not limited to, anthropometric models, statistical models, local binary patterns (LBP), and histograms of gradients (HOG). Biologically inspired features (BIF), which convolve images with Gabor filters followed by pooling, has been one of the most successful and widely used feature extraction technique (Huang 2009; El Dib & El-Saban 2010; Weng et al. 2013). To be precise, the best performing estimation result on the FGNET-AD (Lanitis et al. 2002) benchmark database, reported by El-dib et al. (2010), utilised BIF. For comprehensive reviews, the reader should refer to (Fernandez et al. 2014; Fu et al. 2010).
94
International Conferences Computer Graphics, Visualization, Computer Vision and Image Processing 2017 and Big Data Analytics, Data Mining and Computational Intelligence 2017
The second step to achieving age estimation is pattern learning, which is the automatic mapping of facial features to target ages. Generally, researchers approach age-learning either as a multi-class classification, regression task, or ranking problem. Support vector machines (SVM) and support vector regression (SVR) are the two most commonly used algorithms for classification and regression, respectively. Other forms of regression used in the literature include linear regression (Bukar et al. 2015), quadratic regression (Lanitis et al. 2002), partial least squares variants (Guo & Mu 2013) and canonical correlation analysis (CCA) based methods (Guo & Mu 2013). Recent advances in convolutional neural networks (CNN), has resulted in a major shift in paradigm. Using CNNs, features are automatically learned, facilitating the building of systems that learn from end to end. Hence, researchers have attempted to solve the problem of age estimation using CNNs. One of the earliest works is that of Wang et al. (2015) where they used a 5 layered CNN to extract facial features. Their experiment on the two FGNET-AD (Lanitis et al. 2002) and Morph (Ricanek & Tesafaye 2006) databases yielded good results. However, they were unable to outperform state of the art algorithms. This could be due to the shallow nature of the architecture. Levi and Hassner (2015) proposed a six layered CNN for age group classification. Niu et al. (2016), used a four layered CNN to treat AFAE as an ordinal regression problem. Yi et al. (2014), segmented the face into patches that were fed into a multi-scale 3 layered sub-networks, afterwards the outputs of the sub-networks were aggregated using a final layer. A similar approach was used by (T. Liu et al. 2016), however, instead of using 23 patches, they downsampled it to 8 patches per face. Liu et al. (2015) fused regression and classification via a 22 layer deep CNN in order to perform apparent age estimation. With the exception of (Wang et al. 2015), all other researchers we mentioned above that used CNNs for AFAE, failed to compare their results to the FGNET-AD database. Some compared their results using the Morph database but there are discrepancies in the methods deployed. Some used the protocol suggested by Guo and Mu (2013), while the rest randomly used 80% for training and 20% for testing. To this end, we propose to bridge the gap by carrying out a number of contributions. Since there are very deep CNN models that have been trained on millions of face images, we have chosen to avoid the idea of retraining from scratch. Rather, we plan to use transfer learning to extract ageing features. A suitable dimensionality reduction algorithm will then be used to reduce the size of the extracted features prior to age-pattern learning. We will then conduct an extensive evaluation of our approach on the two benchmark databases; FGNET-AD and Morph II. Thus, this work is aimed at investigating the need to train a new CNN from scratch. Will the pretrained architectures that were carefully modelled on millions of images suffice for the problem of age estimation, or do we need to label more training images for better age estimation? We also want to explore, analyse and evaluate which layer of the existing model is most suitable to use for feature extraction. Furthermore, the effect of alignment, as well as data augmentation will be rigorously investigated.
2. BACKGROUND AND RELATED WORK In recent years, convolutional neural networks (ConvNets or CNNs) have had a great impact on computer vision and machine learning fields due to their ability to learn complex features using nonlinear multi-layered architectures (LeCun et al. 2015). Although originating in the early 1990s, ConvNets were forsaken by the research community due to the assumption that feature extraction using gradient descent will always over fit due to local minima (LeCun et al. 2015). However, its remarkable success in the ImageNet competition of 2012 altered the negativity associated with them. Today, state-of-the-art deep models are used in almost all computer vision applications including, but not limited to, detection (Russakovsky et al. 2015), recognition (He et al. 2015), classification (Huang et al. 2016), and information retrieval (Zhong et al. 2016). Generally, there are three ways of deploying ConvNets; training a network from scratch, fine tuning an existing model, or using off the shelf CNN features. The latter two approaches are referred to as transfer learning (Oquab et al. 2014). Training ConvNet from scratch requires an enormous amount of data, often in millions (Vedaldi & Lenc 2015). Fine tuning involves transferring the weights of the first layers learned from a base network to a target network. The target network is then trained using a new dataset for a specific task; usually different from that of the base network.
95
ISBN: 9978-989-8533-66-1 © 2017
Research has shown that ConvNets efficiently learn generic image features (LeCun et al. 2015; Azizpour et al. 2015). Thus these features can be used directly with simple classifiers to solve computer vision problems. This involves removing the last output layer of a trained ConvNet and using the activations of the last fully connected layer as features. Hence the ConvNet is used as a feature extractor instead of being a classifier. This approach is known as off-the-shelf feature extraction and has been used by several researchers (Azizpour et al. 2015; Sharif Razavian et al. 2014; Zha et al. 2015) to achieve promising results. As a rule of thumb, researchers advise that this approach should be the first approach to solving a computer vision task (Azizpour et al. 2015). Interestingly, studies also show that for a dataset with a small number of images, the off-the-shelf feature extraction technique outperforms both fine-tuning as well as training a network from scratch (Athiwaratkun & Kang 2015). Despite the numerous research conducted by using off the shelf features, focus has been concentrated only on object classification, detection, segmentation, and instance retrieval. This technique has not been exhaustively applied on the problem of age estimation. As stated earlier, recent works on age estimation have concentrated on training ConvNets from scratch (Wang et al. 2015; Levi & Hassner 2015; Niu et al. 2016; Yi et al. 2014) and fine tuning existing networks (Liu et al. 2015). Furthermore, we have noticed that due to the nature of the problems studied in the past, all the reported work employed binary classifiers such as SVM and random forest after the feature extraction. Here we will investigate the performance of ConvNet features when used alongside linear regression techniques. We have also observed that in the literature (Azizpour et al. 2015; Sharif Razavian et al. 2014), activations of the last fully connected layer are used. In this paper, we will be investigating other layers of the ConvNet to thoroughly understand the effect of different activations along the hierarchy. Additionally, researchers almost always use models such as AlexNet (Krizhevsky et al. 2012) and OverFeat (Sermanet et al. 2013), which were trained on the ImageNet dataset. In this work, we have chosen to use VGG-Face descriptor, due to its depth and similarity of the base dataset (i.e. human face) to the data we are working with.
2.1 VGG-Face Model VGG-Face (Parkhi et al. 2015), developed at Oxford University's Visual Geometry Group (VGG), is the application of the very deep ConvNet architecture VGG-16 (Simonyan & Zisserman 2014). Trained on a database of 2.6 million face images and comprised of 2622 unique identities, the database used is made of up to a thousand instances of each subject. The model is configured to take a fixed sized 224 x 224 RGB image as an input; as a form of pre-processing, they initially center-normalise all the training images. The deep architecture is made of a stack of 13 convolutional layers with filters having uniform receptive field of size 3 × 3 and a fixed convolution stride of 1 pixel. As shown in Figure 1, groups of these convolution layers are followed by five max-pooling layers. The stack of convolution layers are then followed by three fully connected layers; FC6, FC7 and FC8. The first two have 4096 channels, while FC8 has 2622 channels which are used to classify the 2622 identities. The model’s implementation also incorporates 2D alignment and triplet loss embedding. Parkhi et al. (2015) have shown that the model outperforms Google's FaceNet (Schroff et al. 2015) and DeepID (Sun et al. 2014) on YouTubeFaces (Wolf et al. 2011).
Figure 1. VGG-Face Model Architecture
96
International Conferences Computer Graphics, Visualization, Computer Vision and Image Processing 2017 and Big Data Analytics, Data Mining and Computational Intelligence 2017
3. METHOD Our approach is to use weights from different layers of the VGG-Face model to extract deep features. Dimensions of the resulting features are then reduced before using regression for age estimation.
3.1 Feature Extraction Given an input image represented as a tensor where is the image height, is the width and the colour channels, and a pre-trained layered ConvNet expressed as a series of functions , let be the outputs of each layer in the network. The output of the intermediate layer can be computed from the function and the learned weights via . In order to fully investigate and evaluate which layer yields optimum results, the activation of five layers; the last two convolution layers (conv5_2, conv5_3), the last max-pool layer (pool5) and first two fully connected layers FC6 and FC7 of the VGG-Face model, are used as separate feature channels.
3.2 Dimensionality Reduction and Regression Due to large dimensions of the extracted features, ranging from 4096 in FC7 to 100352 in conv5_2, there is a need to reduce the feature dimensions thus removing redundant information. Moreover, it is a well-known fact that, for observations and features, the regression estimate is actually not well-defined in a situation whereby . In the past, researchers (Azizpour et al. 2015) used principal component analysis (PCA) for dimensionality reduction. However, PCA only explores the internal structure of the predictor variables (features) without considering their relationship to the response variables. Hence, PCA most likely discards important discriminatory features. As such, we are using partial least squares regression (PLS) (Wold 1975) in this work. PLS reduces data dimensions by creating latent variables which capture directions of highest variance in the predictors (features), as well as the directions that best relate the response to the predictor variables. Thus, it conducts a simultaneous decomposition of data, thereby retaining the most discriminatory features. Moreover, the technique also performs linear regression using the computed latent variables. Let be a matrix of extracted features, and be a matrix of response variables. PLS decomposes the two matrices into, ,
(1)
is a matrix of linear latent variables , and are loadings, while and are matrices of residuals. By solving an optimisation problem, the scores can be computed directly from the feature set , (2) , , where, such that and , for . Furthermore, the PLS regression coefficient is defined as (3) Hence, the relationship between the predictor and response variable is formulated as
In this work,
is an
. vector representing ages.
(4)
4. EXPERIMENTS AND RESULTS In this section, we will discuss the procedure we used for an extensive evaluation of the proposed method of age estimation. All images used in our experiments were cropped to a size of 224 x 224 and a data
97
ISBN: 9978-989-8533-66-1 © 2017
pre-processing step was deployed; this will be discussed shortly. Finally, features were extracted and fed into a PLS-age-learner, after which results were fully evaluated and compared to state-of-the-art algorithms.
4.1 Evaluation Datasets Experiments were performed on two popular and publicly available benchmark databases, the FGNET-AD and Morph Album II datasets. The FGNET-AD consists of 1002 colour and grayscale images of 82 unique subjects from different races. Although the age distribution ranges between 0-69 years, over 700 images are within the 0 - 20 age range and therefore, the distribution of the data is highly skewed. The subjects display varying facial expressions and head poses. Other photographic variations include resolution, illumination and sharpness. Furthermore, all the images have been annotated with 68 landmarks. The majority of documented works have used FGNET-AD for evaluation as it is one of the earliest databases used for age estimation. Morph Album II is the largest publicly available longitudinal face database, consisting of 55,134 images of 13,000 individuals. The age distribution lies between 16 - 77 years, with a median age of 33 years. Each subject has up to 4 images which were collected within a period of 4 years. The database contains people from different ethnicities, with various head poses and facial expressions. The image quality has varying scale, rotation, and translation as well as illumination. Recently, most researchers have used this database for the evaluation and comparison of algorithms.
4.2 Data Splitting Protocol FGNET-AD was evaluated using the leave one-person out (LOPO) cross-validation method (Geng et al. 2007). This iterative procedure involves training the age estimation algorithm with the images of 81 subjects and testing with the images of the individual that was left out. Thus, by the end of all 82 folds, each subject will have been used for testing once; results are calculated based on all the estimations. LOPO emulates a real life scenario where an estimation algorithm is tested on images that it didn't come across during training. Morph II database was split using the protocol suggested by (Guo & Mu 2013). The database is divided into three 3 non-overlapping partitions; S1, S2, and S3 (Others). The algorithm was trained and tested twice. Firstly, S1 was used for training and then tested on a combination of S2 and S3. In a second run, S2 was used for training, while reserving S1 and S3 for testing. Finally, the results of the two tests were averaged.
4.3 Image Pre-Processing The FGNET-AD images were aligned using the 68 landmark annotations provided with the dataset. Furthermore, the image backgrounds were removed to increase image purity (see Figure 2), we then conducted data augmentation, during the training phase, to compensate for the small dataset. Each image was responsible for the generation of 7 additional images; this was achieved through random cropping and warping to the mean shape, as shown in Figure 2. For the Morph database, we used Zhu and Ramanan's (2012) algorithms to detect and annotate the faces with 68 landmarks, thereafter the images were aligned.
Figure 2. Pre-processing of FGNET-AD images
98
International Conferences Computer Graphics, Visualization, Computer Vision and Image Processing 2017 and Big Data Analytics, Data Mining and Computational Intelligence 2017
4.4 Implementation of Age Estimation Utilising the procedure described in Section 3, five sets of estimations were conducted per experiment. Each estimation was performed by extracting features using one of the five layers of the VGG-Face model; conv5_2, conv5_3, pool5, FC6, or FC7 layers. Then, as proposed earlier, we deployed PLS for dimensionality reduction and afterwards, regression. In all the experiments, the numbers of PLS latent variables were chosen via cross-validation. In order to evaluate the effect of image alignment, we conducted two sets of experiments on the Morph database i.e. with and without alignment, denoted as wAlg and woAlg respectively.
4.5 Results Evaluation To evaluate the performance of our algorithm and procedures, two metrics were used throughout our experiments, the Mean Absolute Error (MAE) and Cumulative Score (CS), given by (5) , (6) , where is the ground truth age, and is the estimated age, the number of test images, and denotes the number of images on which the system makes absolute error not higher than years. Initially, the performance of our five features were compared, as can be seen in Tables 1,2, 3 and Figure 3, features extracted using conv5_2 activations outperform the rest on both databases. It is also obvious that the performance degrades as we move higher along the hierarchy towards the FC7 layer. This suggests that the generic features learnt from intermediate layer activations carry more ageing information than the latter layers that are more specific to the problem of face identification. Table 3 also shows that image alignment increases the performance of the off-the-shelf features extracted. We have also observed that PLS has a remarkable dimensionality reduction capability as it reduced thousands of features to few (between 17 and 24 latent variables).
Figure 3. Comparison of CS (a) FGNET-AD (b) Morph woAlg (c) Morph wAlg Table 1. Evaluation of our extracted features on FGNET-AD Layer Conv5_2 Conv5_3 pool5 FC6 FC7
Latent Vars 18 18 18 18 18
MAE 2.70 2.83 2.97 3.89 5.51
99
ISBN: 9978-989-8533-66-1 © 2017
Table 2. Evaluation of our features on Morph II woAlg Layer conv5_2 conv5_3 pool5 FC6 FC7
Tr. Set S1 S2 S1 S2 S1 S2 S1 S2 S1 S2
Latent Vars. 17 17 17 17 17 17 24 24 24 24
MAE 3.93 3.91 3.95 3.93 4.06 4.03 4.33 4.29 4.50 4.51
Avg. MAE
CS < 10 years
3.92
96.71%
3.94
96.61%
4.05
96.06%
4.31
94.32%
4.51
93.26%
Table 3. Evaluation of extracted features on Morph II wAlg Layer conv5_2 conv5_3 pool5 FC6 FC7
Tr. Set S1 S2 S1 S2 S1 S2 S1 S2 S1 S2
Latent Vars. 17 17 17 17 17 17 24 24 24 24
MAE 3.84 3.82 3.87 3.86 4.01 3.97 4.27 4.25 4.45 4.45
Avg. MAE
CS < 10 years
3.83
96.82%
3.87
96.75%
3.99
96.18%
4.26
94.43%
4.45
93.40%
Next, we compared the performance of our best results with state-of-the-art algorithms. As can be seen in Table 4, our results on FGNET-AD using data augmentation supersedes what has been reported over the years. Table 4. Comparison of our best MAE result to state-of-the-art algorithms on FGNET-AD Method BIF(Guo et al. 2009) C & H BIF (Han et al. 2013) OHR (Chang et al. 2011) LSR (Chao et al. 2013) CNN (Wang et al. 2015) BI. AAM (Hong et al. 2013) EBIF (El Dib & El-Saban 2010) Proposed
MAE 4.77 4.60 4.48 4.38 4.22 4.18 3.17 2.70
This further proves the power of ConvNets and their efficiency, especially after conducting meticulous pre-processing steps such as alignment, background removal and augmentation. The performance of our approach on the Morph database is also superior to most of the state-of-the-art algorithms, as can be seen in Table 5. Table 5. Comparison of our best MAE result to state-of-the-art algorithms on Morph II database Layer FMBS (T.-J. Liu et al. 2016) KCCA (Guo & Mu 2013) KPLS (Guo & Mu 2011) 3-step (Guo & Mu 2010) BIF (Guo et al. 2009) Proposed
100
Tr. Set S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 S1 S2
MAE 3.96 4.01 4.00 3.95 4.21 4.15 4.44 4.46 5.06 5.12 3.80 3.76
Avg. MAE 3.99 3.98 4.18 4.45 5.09 3.83
International Conferences Computer Graphics, Visualization, Computer Vision and Image Processing 2017 and Big Data Analytics, Data Mining and Computational Intelligence 2017
5. CONCLUSION We have investigated the use of off-the-shelf ConvNet representations for the problem of age estimation. Using activations from different layers of the VGG-Face model, experiments were conducted on both FGNET-AD and Morph Album II databases. With the simultaneous dimensionality reduction capability of partial least squares regression, we have demonstrated that promising results can be achieved without having to train a ConvNet from scratch, specifically for age estimation. This is obviously an interesting finding especially due to the challenge of annotating face ages. The experiments conducted in this work have shown that in contrast to most researchers’ assumptions, the activations of the last fully connected layer may not be the best solution for all problems. Furthermore, we have demonstrated that data augmentation and alignment affect the performance of the ConvNet features. While our result on the Morph database was comparable to state-of-the-art algorithms, background removal, as well as data augmentation, may yield even better results. We are optimistic that more off-the-shelf ConvNet models and powerful supervised dimensionality reduction algorithms (such as the PLS) can be applied to other computer vision problems.
REFERENCES Athiwaratkun, B. & Kang, K., 2015. Feature Representation in Convolutional Neural Networks. arXiv preprint arXiv:1507.02313. Azizpour, H. et al., 2015. From generic to specific deep representations for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 36–45. Bukar, A.M., Ugail, H. & Connah, D., 2016. Automatic age and gender classification using supervised appearance model. J. Electron. Imaging, 25(6), pp.1–11. Bukar, A.M., Ugail, H. & Connah, D., 2015. Individualised Model of Facial Age Synthesis Based on Constrained Regression. In Image Processing Theory, Tools and Applications (IPTA), 2015 5th International Conference on. IEEE, pp. 285–290. Chang, K.Y., Chen, C.S. & Hung, Y.P., 2011. Ordinal hyperplanes ranker with cost sensitivities for age estimation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.585–592. Chao, W., Liu, J. & Ding, J., 2013. Facial age estimation based on label-sensitive learning and age-oriented regression. Pattern Recognition, 46(3), pp.628–641. Cuendet, G.L. et al., 2016. Facial Image Analysis for Fully Automatic Prediction of Difficult Endotracheal Intubation. IEEE Transactions on Biomedical Engineering, 63(2), pp.328–339. El Dib, M.Y. & El-Saban, M., 2010. Human age estimation using enhanced bio-inspired features (EBIF). 2010 IEEE International Conference on Image Processing, pp.1589–1592. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5651440. Fernandez, C., Huerta, I. & Prati, A., 2014. A Comparative Evaluation of Regression Learning Algorithms for Facial Age Estimation. FFER in conjunction with ICPR, in press. IEEE. Fu, Y., Guo, G. & Huang, T., 2010. Age Synthesis and Estimation via Faces : A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(11), pp.1955–1976. Geng, X. et al., 2007. Automatic Age Estimation Based on Facial Aging Patterns. , 29(12), pp.2234–2240. Guo, G. et al., 2009. Human age estimation using bio-inspired features. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, pp. 112–119. Guo, G. & Mu, G., 2010. Human age estimation: What is the influence across race and gender? In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, pp. 71–78. Guo, G. & Mu, G., 2013. Joint estimation of age, gender and ethnicity: CCA vs. PLS. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, pp. 1–6. Guo, G. & Mu, G., 2011. Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.657–664. Han, H. et al., 2013. Age Estimation from Face Images : Human vs . Machine Performance. He, K. et al., 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Hong, L. et al., 2013. A new biologically inspired active appearance model for face age estimation by using local ordi nal ranking. In Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service. ACM, pp. 327–330. Huang, G., Liu, Z. & Weinberger, K.Q., 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993.
101
ISBN: 9978-989-8533-66-1 © 2017
Huang, T.S., 2009. Human age estimation using bio-inspired features. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.112–119. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5206681. Krizhevsky, A., Sutskever, I. & Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. pp. 1097–1105. Lanitis, A., Taylor, C. & Cootes, T., 2002. Toward Automatic Simulation of Aging Effects on Face Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), pp.442–455. LeCun, Y., Bengio, Y. & Hinton, G., 2015. Deep learning. Nature, 521(7553), pp.436–444. Levi, G. & Hassner, T., 2015. Age and Gender Classification using Convolutional Neural Networks. , pp.34–42. Liu, T. et al., 2016. Age Estimation Based on Multi-Region Convolutional Neural Network. In Chinese Conference on Biometric Recognition. Springer, pp. 186–194. Liu, T.-J. et al., 2016. Age estimation via fusion of multiple binary age grouping systems. In Image Processing (ICIP), 2016 IEEE International Conference on. IEEE, pp. 609–613. Liu, X. et al., 2015. Agenet: Deeply learned regressor and classifier for robust apparent age estimation. In Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 16–24. Niu, Z. et al., 2016. Ordinal regression with multiple output cnn for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4920–4928. Oquab, M. et al., 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1717–1724. Parkhi, O.M., Vedaldi, A. & Zisserman, A., 2015. Deep face recognition. In British Machine Vision Conference. p. 6. Rahulamathavan, Y. et al., 2013. Facial expression recognition in the encrypted domain based on local fisher discriminant analysis. IEEE Transactions on Affective Computing, 4(1), pp.83–92. Ricanek, K. & Tesafaye, T., 2006. Morph: A longitudinal image database of normal adult age-progression. In 7th International Conference on Automatic Face and Gesture Recognition (FGR06). IEEE, pp. 341–345. Russakovsky, O. et al., 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), pp.211–252. Schroff, F., Kalenichenko, D. & Philbin, J., 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 815–823. Sermanet, P. et al., 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. Sharif Razavian, A. et al., 2014. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 806–813. Simonyan, K. & Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Sun, Y., Wang, X. & Tang, X., 2014. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1891–1898. Vedaldi, A. & Lenc, K., 2015. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp. 689–692. Wang, X., Guo, R. & Kambhamettu, C., 2015. Deeply-learned feature for age estimation. In 2015 IEEE Winter Conference on Applications of Computer Vision. IEEE, pp. 534–541. Weng, R. et al., 2013. Multi-feature ordinal ranking for facial age estimation. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, pp. 1–6. Wold, H., 1975. Quantitative sociology: international perspectives on mathematical and statistical model building, chapter path models with latent variables: the NiPALS Approach., Academic, London. Wolf, L., Hassner, T. & Maoz, I., 2011. Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 529–534. Xia, S. et al., 2012. Understanding kin relationships in a photo. IEEE Transactions on Multimedia, 14(4), pp.1046–1056. Yi, D., Lei, Z. & Li, S.Z., 2014. Age estimation by multi-scale convolutional network. In Asian Conference on Computer Vision. Springer, pp. 144–158. Zha, S. et al., 2015. Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144. Zhong, Y., Arandjelović, R. & Zisserman, A., 2016. Faces in places: Compound query retrieval. In BMVC-27th British Machine Vision Conference. Zhu, X. & Ramanan, D., 2012. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, pp. 2879–2886.
102