The 10th International Conference on Management of Emergent Digital EcoSystems (MEDES'18)in Tokyo Metropolitan University, Minami-Osawa campus, Tokyo/Japan, September, 25-28, 2018
An IoT framewor k fo r f a ce d etectio n applied to a multimedia se n so r s y ste m Elie TAGNE FUTE University of B u ea University of Ds ch a ng
eliefute@ yahoo.fr
Lionel Landry SOP DEFFO University of B u ea University of Ds ch a ng
Emm anu el TONYE
University of Ya ound e I
[email protected]
[email protected]
ABSTRACT In r ecent years, we assist t o growing use of digital technology leading t o a demand fo r be tt e r procedures t o process multimedia information. Consequently we need t o build up a system enabling environmental monitoring by collecting da ta from sensing multimedia devices, including cameras and microphones: i t is t he no tion of Int e r ne t of Multimedia Things (IoMT). Among t he va riety of proposed appli-cations we have pedestr ians det ection, people counting, face de t ec t ion /r ecogni t ion ... Even if i t is tr ue t ha t resul ts a re impressive in general and in face det ect ion in pa rticula r, t he r e a re still difficulties fo r t hei r use in real time applica-tions. T ha t is why many specialists o r ient a t e t hei r research on t his path. We proposed an IoM T framewo rk fo r fo r face de t ect ion t ha t uses multimedia sensors. T he proposed ap-proach has t h r ee major contributions. F i r s t ly we propose an IoM T model fo r face detection, secondly we proposed a joint convolutional neural netwo rk model reinforced by CReLUs modules and finally we propose and implementation t o real time applications.
Categories and Subject Descriptors Computer vision [Image processing]: Artificial neural netwo rks—Intelligent networks
General Terms Delphi t heo r y
Keywords IoMT, convolutional neural netwo rk, face det ection, model, CReLU module
1. INTRODUCTION A human brain is wired t o do objects de t ection and recogni tion automatically and instantly. Computers however a re no t capable of t his kind of high-level generalization, so we
need t o t each them how t o do each st ep in t his process separately. Face det ec tion t o t he o t he r hand, can be r egarded as a specific case of object-class det ec t ion where t he t ask is t o find t he locations and sizes of all objects in an image t ha t belong t o a given class. Ever since t he seminal work of Viola e t al.[24], wit h some improvements [27] t he boosted cascade wi t h simple fea tur es becomes t he mos t popular and effective design fo r practical face det ection. Most of t hem follow t he boosted cascade framework wi t h more advanced fea tures which help t hem t o cons tr uc t a more accura te binary classifier a t t he expense of ex tr a computa tion. In 2005, ano t her popular method was invented called Histogram of Oriented Gradient s [16] o r just HOG fo r sho rt . I t is based on evaluating well-normalized local histograms of image gradient o rient a tions in a dense grid. T he basic idea is t ha t local object appearance and shape can often be characterized r a t he r well by t he distribut ion of local int ensity gradient s o r edge directions, even wi t hou t precise knowledge of t he corresponding gradient o r edge positions. Now comes Convolutional Neural Ne two r k (CNN) applied t o face det ect ion because compared wi th t he previous hand-craft ed fea tures, CNN can automat ically learn fea tures t o capt ur e complex visual variations by leveraging a large amount of training da t a and i t s t es t ing phase can be easily paralyzed on G P U cores fo r accelerat ion. This paper pr esents a framework fo r face de t ection applied t o a multimedia sensor system and has t h r ee main contributions. F i r s t ly we propose an IoM T model fo r face det ec tion, secondly we proposed a joint convolutional neural netwo rk model reinforced by CReLUs modules and finally we propose an implementation t o real t ime applications. T he r es t of t he paper is organized as follows, Section 2 discusses on t he previous work done on face det ection especially ones which use neural netwo rks. In Section 3, t he details on t he proposed framewo rk a re exposed. In section 4, t he tr aining methodology is presented. Section 5 presents t he implementa tion, analysis and r esult s int e r p r e t a t ions included. Finally, Sect ion 6 concludes t he wo rk by doing an appraisal and by proposing amelio rat ion perspectives.
2. RELATED WORK Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MEDES’18 September 25-28, 2018, Tokyo, Japan Copyright 2018 ACM X-XXXXX-XX-X/XX/XX ...$10.00.
Face de t ection can be classified according t o t wo classes of techniques namely feature-based approach which consist of low-level visual fea t ures extract ion (color, edges, etc.) followed by det ec tion of face fea t ures (mound, lips, nose, eyes, etc.) and i t s rela tive placements and image based
2.1
Features based face detection
The 10th International Conference on Management of Emergent Digital EcoSystems (MEDES'18)in Tokyo Metropolitan University, Minami-Osawa campus, Tokyo/Japan, September, 25-28, 2018
T hese approaches use facial fea tures t o cons tr uc t t hei r det ect ion model. If we r efer t o t he wo rk of [9], those approaches can be regrouped into:
2.1.1
Low level analysis detection
I t segments t he visual fea tures by using pixels properties, gray scale level and mo tion information. T his can t her efo r e been used in edge based techniques as in [4] where t he matches edge labels t o face model fo r verifica tion.
2.1.2
Features analysis
Here, information about face a re used t o remove t he ambiguity produced low level analysis. T his as has mention Urvashi in [1] allows less prominent fea tures t o be hypothesized.
2.1.3 Active shape model When we want t o define actual physical and higher level appearance, we use t his model. T im Coo t es e t al. [22] have developed and use t he model t o int e r ac t wi t h local image deforming t o t ake t he shape of t he fea tur es. T his approach mainly consists of looking in t he image around each point fo r a be tt e r position fo r t ha t point and updates t he model pa r amet e rs t o bes t ma tch t o t his new found position.
2.2
Image based techniques
T he problem wi th t he previous method is t ha t , explicit modeling of facial a re troubled by t he unpredictability of face appearance and environmental conditions. T hey t herefo r e need more robust techniques, capable of performing in unfriendly environments, such as det ec ting multiple faces wi th clutter-int ensive backgrounds. T he basic approach in recognizing face pa tt e r ns is via a training procedure which classifies examples into face and non-face p r o t o type classes. Comparison between these classes and a 2D int ensi ty a rray (hence t he name image based) ex tr ac t ed from an input image allows t he decision of face existence t o be made. Most of t he image-based approaches apply a window scanning technique fo r det ec ting faces. T he window scanning algo ri thm is in essence just an exhaustive search of t he input image fo r possible face locations a t all scales, bu t t he r e a re varia tions in t he implementation of t his algo ri thm fo r almost all t he image-based systems. T hey a re classified according t o t wo main families namely linear subspace methods and neural netwo rk methods.
2.2.1 Linear subspace methods T hese t echniques r efer generally t o principal component analysis (P CA) [14], linear discriminated analysis (LDA) [26], and facto r analysis (FA). Pr incipal component analysis (P CA) is a t echnique fo r reducing t he dimensionality of large da taset s, increasing inter pr etabili ty bu t a t t he same time minimizing information loss. I t does so by cr ea ting new unco rrela t ed variables t ha t successively maximize variance. Linear Discriminated Analysis (LDA) t o t he o t he r hand is most commonly used as dimensionality reduction technique in t he pre-processing s t ep fo r pa ttern-classifica tion and machine learning applications. T he goal is t o projec t a da taset onto a lower-dimensional space wit h good classes po rtability in o r de r avoid ove r-fi tting (curse of dimensionality) and also reduce computational costs.
2.2.2 Statistical methods
Apa rt from linear subspace methods, t he r e a re several o t he r s t a t ist ical approaches t o face det ect ion which a re Systems based on information t heo ry, suppo rt vecto r machine and Bayes decision rule. Huang [3] proposed a syst em based on Kullback r ela tive informat ion called Kullback divergence which is a non-negative measure of t he difference between two probability density functions P Xn and MXn fo r a random process Xn. I t is based on an earlier wo rk of maximum likelihood face det ec t ion [2]. Osuna e t al. [8], proposed a suppo rt vect o r machine (SVM) [23] applied t o face det ect ion which follows t he same framewo rk as t he one developed by Sung and Poggio [21] t ha t scan input images wit h a 19X19 window. Here a SVM wit h a 2nd-degree polynomial as a kernel function is trained wit h a decomposition algo rithm which guarantees global optimality. Finally Schneiderman and Kanade [19], [20] describe two face de t ec t o r s based on Bayes decision rule which is present ed as a likelihood r a t io t es t using random variables. If t he likelihood r a t io is g r ea t er than t he r ight side, t hen i t is decided t ha t an object (a face) is pr esent a t t he cu rr ent location.
2.2.3 Neural network methods Neural netwo rks based techniques a re popular fo r pa tt e r n recognition and a re inspired by t he human brain. T he fi r s t advanced neural approach which r epo rt ed result s on a large, difficult da t ase t was by Rowley e t al. [10]. T hei r syst em inco rpo ra tes face knowledge in a retinally connected neural netwo r k designed t o look a t windows 20 X 20 pixels. One of t he r ecent approach is a convolutional neural netwo r k (CNN) based de t ection method proposed by Girshick e t al. [17] which has achieved t he s t a t e of t he a rt resul t on VOC 2012. I t follows t he recognition using regions paradigm. I t generates ca tego ry independent region proposals and ex tr ac t s CNN fea tures from t he regions. Then i t applies class-specific classifiers t o recognize t he object ca t ego ry of t he proposals. O t he r methods have followed such as Chen e t al. [5] who propose t o use t he shape indexed fea tures t o jointly conduct face det ection and face alignment Zhang e t al. [25] and P a r k e t al. [6] adopt t he mul ti resolution idea in general object det ection. A more r ecent framewo rk includes t he use of a deep cascaded mul ti-t ask [7] framework which exploits t he inher ent co rr ela tion between det ect ion and alignment t o boost up t hei r performance. Our proposed model is mostly based on t his last model. T he models proposed in all above methods a r e characterized by two phases named t he tr aining phase and t es t phase. In t he training phase t he system learn how t o recognize known fea t ures which specific developed techniques. Once i t is done we can now use t he model t o t es t on unknown da ta and le t t he training system de t ec t s automatically t he needed fea tures.
3. THE PROPOSED FRAMEWORK T he overall system consist s of four different pa rt s which go from t he acquisition t o t he det ect ion. T his a rchit ect ur e is present ed on Figur e 3.4
3.1 The Multimedia sensors In t his pa rt , we find any senso r device capable of collecting video information in r eal t ime. This includes cameras, sma rt phones, comput er camera, raspberry camera ... I t should be no t ed t ha t a video a jus t a collections of images commonly called frames, consequent t he acquisition device
The 10th International Conference on Management of Emergent Digital EcoSystems (MEDES'18)in Tokyo Metropolitan University, Minami-Osawa campus, Tokyo/Japan, September, 25-28, 2018
F igur e 1: Frame work d ete c t ion s y s te m . localization. We t he r efo r e use t he same ma thema t ical formula fo r these tasks.
4.1 F igur e 3: P ropos e d C Re lu modul e. needs t o have an acceptable number of frames t ha t i t can capt ur e per second such as 30fps fo r example.
3.2 Pre-p rocessing Unit Here all pr e operations needed as t o be perfo rmed . T his include image denoising, image resizing, image filtering... T his is t o ensure t ha t t he quality of t he ou t pu t r esul t is acceptable. In addition a bad image can lead t o poor performances fo r a processing algorithm.
3.3 The Proposed model As mention earlier, our model is based on t he model proposed in [7] and is t her efo r e cons t i t u t ed of t h r ee neural netwo rks namely P -Ne t-imp, R-Net-imp and ONet.
3.3.1 The P-Net-imp neural network I t is a fully convolutional network, called Proposal Netwo rk ( P -Ne t ), and used t o obtain t he candidate facial windows and t hei r bounding box regression vecto rs. T hen candidates a r e calibrated based on t he estima t ed bounding box regression vecto rs. Finally a non-maximum suppression (NMS) is employed t o merge highly overlapped candidates. T he fo rmer a r chi t ect ur e of t his netwo r k is presented in F igu r e 3.3.1. As i t is commonly known t ha t t he deeper t he netwo rk, more accura te is t he r esul ts, we increase t he depth of t he previous a r chi t ec t u r e by pu tting in a proposed version of CRelu (see F igur e 3.3.1) module which replaces t he normal ReLU and P ReLU activa tion functions. T his is wi t h t he knowledge t ha t CReLU increases t he quality of t he r esul t as proven in [11]. T his gives r ise t o t he model shown in F igure 3.3.1.
3.4 The R-Net neural network All candidates a r e fed t o ano t her CNN, called Refine Netwo rk (R-Net), which fu rt he r reject s a large number of false candidates, performs calibration wit h bounding box r egression, and conducts NMS. T his is shown in Figur e 3.4
3.5 The R-Net neural network T his neural netwo rk is t he same as t he one in [7], and i t aim is t o identify face regions wit h more supervision. In pa rticula r, t he ne two rk will ou t pu t five facial landmarks positions. F igur e 3.5 shows all t he ne two r k components.
4. TRAINING METHODOLOGY Le t us recall t ha t t he proposed model is tr ying t o perform t he same t ask as t he original model which a re face/non-face classification, bounding box regression, and facial landmark
Face classification
Here, t he learning objective is fo rmulated as a two-class classification problem. Fo r each sample, we use t he crossentropy loss Where t he probability is produced by t he net wo r k t ha t indicates sample being a face and t he no t a tion deno t es t he g r ound- tr u t h label.
4.2 Bounding box regression Here a prediction of t he offset between a window and i t nea r est ground tr u t h (i.e., t he bounding boxes left, top, height, and width) is made. T he learning objective is formulated as a regression problem wi t h t he Euclidean loss fo r each sample: Where is t he regression t a r ge t obtained from t he netwo rk and yi box is t he g r ound- tr u t h coordinate. T he r e a re four coordinates, including lef t t op, height and widt h, and t hus .
4.3
Facial landmark localization
Similar t o bounding box r egression t ask, facial landmark det ect ion is formulated as a regression problem and we minimize t he Euclidean loss: Where is t he facial landmark coordinat es obtained from t he netwo rk and is t he ground tr u t h coordinate fo r t he i-th sample. T he r e a re five facial landmarks, including lef t eye, r ight eye, nose, lef t mouth co rner, and r ight mout h corner, and t hus .
4.4
Pnet-imp training phase
Before talking about t he tr aining phase le t us highlight a fact. I t is impo rt ant t o no tice t ha t in t he original methodology; t he training is done six t imes t ha t is twice per neural netwo rk. A t fi r s t a normal training is done fo r P Ne t neural netwo r k using da ta collected from a da t ase t (WIDER FACE) in our case. T hen t his stage genera tes ano t her se t of da t a which is used t o r e tr ain t he same neural netwo rk. T his process is called fine tuning. However t his second process t akes a lo t of t ime and needs a lo t of resources and t he commonly used approaches a re G P U r equi rement s o r cloud computing techniques. T ha t is why we t hought of proposing a model t ha t will enable us t o avoid t he second stage of training fo r t he result s obtained from t he fi r s t stage have high level of accuracy and can t her efo r e be used in some applications. In this paper we s t a rt ed by modifying t he fi r s t neural net wo r k ( P Ne t netwo rk) and see i t s effect on t he resul ts and we intend t o do t he same fo r t he remaining networ ks (RNet and ONet) in t he fu t u r e work. We the r efo r e perform t he P Ne t training using t he following steps:
4.4.1 Collect appropriate data set Here we select a da ta se t t ha t cont ains labeled face wi th t hei r corresponding bounding boxes, fo r t ha t we choose t o use WIDER FACE [18] da t ase t fo r i t is one of t he most used
The 10th International Conference on Management of Emergent Digital EcoSystems (MEDES'18)in Tokyo Metropolitan University, Minami-Osawa campus, Tokyo/Japan, September, 25-28, 2018
F igur e 2: P N e t convolutionnal mod e l.
F igur e 4: P ropos e d P - Net -imp n etwork .
F igur e 5: R N e t convolutionnal mod e l.
F igur e 6: O N et convolutionnal mod e l.
The 10th International Conference on Management of Emergent Digital EcoSystems (MEDES'18)in Tokyo Metropolitan University, Minami-Osawa campus, Tokyo/Japan, September, 25-28, 2018
Tabl e 1: Param ete rs us ed in t h e training process Values P a r ame t e r s Number of i t e r a t ions 120000 Ini tialization method Xavier method Propaga tion algo ri thm S tochastic gradient descent (SGD) Momentum 0.9 Weigh decay 0.0005 Ba tch size 100 Learning r a t em 0.0001 da ta se t t o de t ec t bounding box
4.4.2 Dividing images into categories Here we divide our da t ase t into t h r ee catego ries; positive, pa rt and negative images based on t he IoU algo rithm [15] which s t a t es t ha t an image is said t o be positive if t he intersection over union r a tion is g r ea t e r t han a se tt le maximum value, pa rt if i t is between t he maximum value and a se tt le minimum value and nega t e otherwise.
F igur e 7: Dete c te d face using t h e original mode l Source: F D D B da t as et [12]
4.4.3 Training of the model Here we tr ain t he proposed model using a neural netwo rk library such as tensorflow, t heano, caffe, t o r ch e t c. bu t since t he idea was t o propose a method based on C P U t ha t do no t depends on any G P U, we choose t o use t he C P U instance of caffe .
5. IMPLEMENTATION AND RESULTS ANALYSIS We choose Caffe [13] t o implement our solution. I t is Caffe a deep learning framewo rk made wit h expression, speed, and modula rity in mind. I t is developed by Berkley AI Research (BAIR) and by community contributo rs. T he choice is motivated by: Expressive a r chi t ect ur e Extensible code Speed Community I t is w r i tt en in C / C + + and has a python interface. T he pa r amet e rs used t o tr ain our P NE t model a re listed in Table r eftab1 We t ook a t random some images in F DDB da t ase t [29] t o t es t our model and t his gave rise t o t he following r esul ts. F igur e reffig6, F igur e reffig8 and Figur e reffig10 pr esent t he r esul ts of faces de tect ed by t he original model, where t he P ne t has been trained wi t h our proposed training method-ology and t he remaining netwo rks (RNet and ONet) use t he existing pre-trained models. On t he o t he r hand F igur e r ef-fig7, F igur e reffig9 and Figur e reffig11 pr esent t he r esul t s of faces det ec t ed by t he our proposed model. I t can be eas-ily seen t ha t t he r esul t s of our model is more accura t e as i t de t ec t t he whole face while t he original model t end t o de t ec t just a pa rt of t he face. T his shows how t he CReLUs enable t o obtain t he good resul ts quickly. T his is be tt e r observed on loss curves (F igur e r effig12, reffig13, reffig14 and reffig15), where loss reaches 0 in t he proposed mode which is no t t he case on t he existing model. T ha t is why in t he original method t he aut ho rs fined t hei r model with ano t her training stage t o improve t he resul ts while in our model, t he r esul ts is already acceptable a t t his stage. F igur e reffig12 shows t he evolution of losses during t he training phase and t he t esting phase fo r t he two models. T he scheme is t he same t ha t is i t increases and decreases. Nevertheless, t he r e is a difference. While in t he original method loss decreases t o a minimum value g r ea t e r t han 0,
F igur e 8: Dete c te d face using t h e proposed model Source: F D D B da t as et [12]
F igur e 9: Dete c te d face using t h e original mode l Source: F D D B da t as et [12]
The 10th International Conference on Management of Emergent Digital EcoSystems (MEDES'18)in Tokyo Metropolitan University, Minami-Osawa campus, Tokyo/Japan, September, 25-28, 2018
F igur e 10: Dete c te d face using t h e proposed model Source: F D D B da t as et [12]
F igur e 13: Loss evolu t ion during training of t h e original model
F igur e 14: Loss evolu t ion during training of t h e proposed model
F igur e 11: Dete c te d face using t h e original model Source: F D D B da t as et [12]
i t reaches 0 in t he proposed model. Since t he aim of t he training process is t o minimize losses which is t he difference between t he expect ed r esult s and obtained resul ts, we can t he r e said t ha t our model outperfo rms t he exist ing one in t e r ms on losses. This can be clearly seen on Figur e reffig12, Figu r e r effig13, Figur e r effig14 and F igur e reffig15 r espectively. Here i t is clearly seen t ha t during t he training phase t he minimum loss never reaches 0 fo r t he original model (F igu r e reffig12) which is no t t he case in t he proposed model (F igur e reffig13). T he second curve r eaches 0 fo r i t touches t he x axis many t imes during t he whole training. T he same observations can be made on F igur e reffig14 and F igur e reffig15. T he curve never reaches in t he case of t he original model bu t succeed in dong in t he propose model. T he model can ther efo r e be recommended fo r some neural ne two rk applications.
5.0.4 Application in real time
F igur e 12: Dete c te d face using t h e proposed model Source: F D D B da t as et [12]
In t his section we pr esent t he execution of t he framework in a real t ime situa tion. F igur e reffig16, F igur e reffig19, F igur e reffig17 and F igur e r effig18 give some snapshots of t ha t execution in real t ime where t he multimedia sensor used here is t he camera of t he laptop. A t t he left we have t he execution using t he existing model wi t h pre-vious mentioned trained t he training methodology and a t t he left t he execution using t he proposed model using t he same training
The 10th International Conference on Management of Emergent Digital EcoSystems (MEDES'18)in Tokyo Metropolitan University, Minami-Osawa campus, Tokyo/Japan, September, 25-28, 2018
F igur e 15: Loss evolu t ion during te s t ing of t h e original model
F igur e 17: Dete c t ion in re al t im e
F igur e 16: Loss evolu t ion during te s t ing of t h e proposed model methodology and da tase t. I t can clearly be seen t ha t t he proposed model out perfo rmed t he existing one in t e r m of t he accuracy of t he r esul ts. T he face det ec t ed by t he proposed model is more precise (F igur e reffig16, F igur e reffig19) compare t o t he face obtained from t he original model. While in t he fi r s t si t ua tion F igur e r effig16, t he rectangula r drawn in our si tua t ion is thicker than t he one obtained from t he existing model, t he second si t ua t ion (F igure reffig19) shows how t he de t ec t d face is been cu t a t t he level of t he chin fo r t he existing model, t he rectangle det ec t ed by our model covers t he whole face. Fu rt he r mo r e in many si tua tions, t he proposed model do no t de t ec t t he face when i t is t u r ned in some di rections ( F igur e r effig17 and F igur e r effig1). This show once more t he benefit of our proposed a r chi t ect ur e.
F igur e 18: Dete c t ion in re al t im e
6. CONCLUSION We presented in this paper a light joint convolutional neural netwo rk cascade model fo r face de t ection fo r C P U. The model is based on a mul ti t ask convolutionnal neural netwo rk t ha t jointly de t ec t s face and landmark positions. T he model has been reinforced by CRelu modules and we showed how t he r esul t s were be tt e r fo r t he propose model exactly de t ec t t a r ge t face. Fo r fut ur e wo rk we intend t o extend t his improvement t o t he t wo remaining sub models (RNet ant ONet) so as t o genera te global training da t a t o fine tuning t he model. Fu rt he r mo r e since t he generation of new da ta t o fine tuning t he proposed model takes a lo t of time (more than two months a t time) we intend now t o use G P U t o fast e r i t execution t o fast er i t execution.
7. REFERENCES [1] P. Brimblecombe. Face de t ect ion using neural
F igur e 19: Dete c t ion in re al t im e
The 10th International Conference on Management of Emergent Digital EcoSystems (MEDES'18)in Tokyo Metropolitan University, Minami-Osawa campus, Tokyo/Japan, September, 25-28, 2018
F igur e 20: Dete c t ion in re al t im e
[2]
[3]
[4] [5] [6] [7]
[8]
[9] [10]
[11]
[12] [13]
netwo rks. H615-Meng Electronic Engineering School of Electronics and Physical Sciences, 2005. A. J . Colmenarez and T . S. Huang. Maximum likelihood face detection. IEEE Proc. of 2nd In t. Conf. on Automa tic Face and Gesture Recognition, Vermon t , pages 222–224, 1996. A. J . Colmenarez and T . S. Huang. Face de t ection wi th information-based maximum discrimination. IEEE Proc. of In t. Conf. on Computer V ision and Patt ern Recognition, 6, 1997. E.-H. Craw, I. and Lishman. Automa tic ex tr ac t ion of face fea ture. Patt ernRecogni tion, 1987. Y. W. X. C. D. Chen, S. Ren and J . Sun. J oint cascade face det ec tion and alignment. Computer Vision ECC V , 2014. D. R. D. P a r k and C. Fowlkes. Multiresolutionmodels fo r object det ection. Computer Vision ECC V , 2010. Y. W. X. C. Dong Chen, ShaoqingRen and J . Sun. J oint cascade face det ec tion and alignment. European Conference on Computer Vision ECC V , pages 109– 122, 2014. R. F . E. Osuna and F . Girosi. Tr aining suppo rt vecto r machines: An application t o face det ection. IEEE Proc. of In t. Conf. on Computer Vision and Patt ern Recognition, 6, 1997. B. K. L. Erik Hjelmas. Face det ection: A survey. Computer Vision and Image Understanding, 83:236–274, April 2011. S. B. H. A. Rowley and T . Kanade. Neural netwo rkbased face detection. IEEE Trans . Patt ern Anal. S. X. hifeng Zhang Xiangyu Zhu Zhen, Lei Hailin and W. S. Z. Li. Faceboxes: A cpu real-time face de t ec t o r wi th high accuracy. arXiv: 1708.05234v2 [cs. C V ], 6:August, 2017. V. J ain and E. G. Learned-Miller. Fddb: A benchmark fo r face det ect ion in unconstrained settings. UMass Amherst Technical Report , 2010. Y. J ia, E. Shelhamer, J . Donahue, S. Karayev, J . Long, R. Girshick, S. Guadarrama, and T . Da rrell. Caffe: Convolutional a r chi t ec t u r e fo r fas t fea t ur e
embedding. arXiv preprint arXiv:1408.5093, 2014. [14] I. T . Jolliffe1 and J . Cadima. Pr incipal component analysis: a review and r ecent developments. Philosophy Transaction A Ma th Physics Engineering Sciences, April 2013. [15] MdAtiqurRahman and Y. Wang. Optimizing intersection-over-union in deep neural netwo rks fo r image segmentation. International Symposium on V isual Computing, 2016. [16] NavneetDalal and Bill Triggs. Histograms of o rient ed gradients fo r human det ection. H A L- IN R I A open archive, 2005. [17] T . D. R. Girshick, J . Donahue and J . Malik. Rich fea t ur e hierarchies fo r accura t e object det ect ion and semantic segmentat ion. arX iv preprint arX iv, 2013. [18] C. C. L. S. Yang, P. Luo and X. Tang. Wider face: A face det ect ion benchmark. arXiv:1511.06523. [19] H. Schneiderman and T . Kanade. Pr obabilistic modeling of local appearance and spat ial relat ionships fo r object r ecognition. IEEE Conference on Computer V ision and Pa tt ern Recogni tion, 6, 1998. [20] H. Schneiderman and T . Kanade. A s t a t is tical model fo r 3d object det ection applied t o faces and cars. IEEE Conference on Computer V ision and Pattern Recognition, 2000. [21] K.-K. Sung and T . Poggio. Example-based learning fo r view-based human face det ection. IEEE Trans . Patt ern Anal . Mach. Intelligence, 20, 1998. [22] UrvashiBakshi and RohitSinghal. A survey on face det ect ion method and fea tures ex tr ac tion techniques of face recognit ion. International Journal of Emerging Trends Technology in Computer Science ( I J E TT CS ) , 3(3):May, 2014. [23] V. Vapnik. t he na t u r e of s t a t ist ical learning t heo r y. Springer-Verlag, New York, 1995. [24] P. A. Viola and M. J . Jones. Rapid object det ect ion using a boosted cascade of simple fea t ur es. Proceeding IEEE Conference on Computer V ision and Pattern Recognition, 2001. [25] G. Z. W. Zhang and D. Samaras. Real-time accura te object det ection using multiple resolut ions. Proc.IEEE International Conference on Computer Vision, 2007. [26] J . Ye. Cha ract eriza t ion of a family of algo rithms fo r generalized discriminant analysis on undersampledproblems. Journal of Machine Learning Research, 6, April 2005. [27] C. Zhang and Z. Zhang. A survey of r ecent advances in face detect ion. Technical Report MSR -TR-2010-66, 2010.