Face Recognition Through Regret Minimization by Daniel Yule B . S c , University of Northern British Columbia, 2009 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN MATHEMATICAL, COMPUTER, AND PHYSICAL SCIENCES (COMPUTER SCIENCE) THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA May, 2011 © Daniel Yule, 2011 1*1 Library and Archives Canada Bibliotheque et Archives Canada Published Heritage Branch Direction du Patrimoine de I'edition 395 Wellington Street OttawaONK1A0N4 Canada 395, rue Wellington OttawaONK1A0N4 Canada Your file Votre reference ISBN: 978-0-494-75163-3 Our file Notre r6f6rence ISBN: 978-0-494-75163-3 NOTICE: AVIS: The author has granted a nonexclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or noncommercial purposes, in microform, paper, electronic and/or any other formats. L'auteur a accorde une licence non exclusive permettant a la Bibliotheque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par telecommunication ou par I'lnternet, preter, distribuer et vendre des theses partout dans le monde, a des fins commerciales ou autres, sur support microforme, papier, electronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriete du droit d'auteur et des droits moraux qui protege cette these. Ni la these ni des extraits substantiels de celle-ci ne doivent etre imprimes ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conformement a la loi canadienne sur la protection de la vie privee, quelques formulaires secondaires ont ete enleves de cette these. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. ••I Canada Abstract Face Recognition is an important problem for Artificial Intelligence Researchers, with applications to law enforcement, medicine and entertainment. Many different approaches to the problem have been suggested; most approaches can be categorized as being either Holistic or Local. Recently, local approaches have shown some gains. This thesis presents a system for embedding a holistic algorithm into a local framework. The system proposed builds on the concept of Regional Voting, to create Weighted Regional Voting which divides the face images to be classified into regions, performs classification on each region, and finds the final classification through a weighted majority vote on the regions. Three different weighting schemes taken from the field of Regret Minimization are suggested, and their results compared. Weighted Regional Voting is shown to improve upon unweighted Regional Voting in every case, and to outperform or equal many modern face recognition algorithms. n Contents 1 2 Abstract ii List of Tables v List of Figures vi Acknowledgement viii Introduction 1 1.1 Overview 1 1.2 Contribution 2 Literature Survey 4 2.1 Holistic Approaches 6 2.1.1 Principal Component Analysis 7 2.1.2 Fisher's Linear Discriminant 10 2.1.3 Spectral Regression 15 2.1.4 Locality Preserving Projections 20 2.2 Local Approaches 23 2.2.1 Local Binary Pattern 25 2.2.2 Volterrafaces 27 iii 2.3 3 Regional Voting 30 Proposed Algorithm 33 3.1 Regret Minimization 34 3.2 Estimating Regional Weights 36 4 Experiments 40 5 Analysis 55 6 Summary 61 Bibliography 63 IV List of Tables 4.1 Comparison of various face classifiers on Yale Database with two training subjects with 16 x 16 divisions 4.2 Comparison of various face classifiers on Yale Database with five training subjects with 16 x 16 divisions 4.3 52 Comparison of various face classifiers on ORL Database with five training subjects with 16 x 16 divisions 4.5 51 Comparison of various face classifiers on ORL Database with two training subjects with 16 x 16 divisions 4.4 50 53 Comparison of various face classifiers on CMU PIE Database with five training subjects with 16 x 16 divisions v 54 List of Figures 2.1 Regional Voting Algorithm 32 3.1 Polynomial Weights Algorithm 35 3.2 Exponential Weights Algorithm 36 3.3 Exponential Weights Algorithm 37 3.4 Weighted Regional Voting Algorithm 39 4.1 Embedding various holistic classifiers in different sized regions on Yale with 2 training images and exponential weighting 4.2 42 Embedding various holistic classifiers in different sized regions on Yale with 5 training images and exponential weighting 4.3 43 Embedding various holistic classifiers in different sized regions on ORL with 2 training images and exponential weighting 4.4 43 Embedding various holistic classifiers in different sized regions on ORL with 5 training images and exponential weighting 4.5 Comparing various weighting schemes in different sized regions on Yale with 2 training images and SLPP as the regional classifier 4.6 44 ... 45 Comparing various weighting schemes in different sized regions on Yale with 5 training images and SLPP as the regional classifier vi ... 45 4.7 Comparing various weighting schemes in different sized regionson ORL with 2 training images and SLPP as the regional classifier 4.8 46 Comparing various weighting schemes in different sized regions on ORL with 5 training images and SLPP as the regional classifier 4.9 ... ... 46 Comparing various weighting schemes in different sized regions with best weights on Yale with 2 training images and SLPP as the regional classifier 47 4.10 Comparing various weighting schemes in different sized regions with best weights on Yale with 5 training images and SLPP as the regional classifier n 47 4.11 Comparing various weighting schemes in different sized regions with best weights on ORL with 2 training images and SLPP as the regional classifier 48 4.12 Comparing various weighting schemes in different sized regions with best weights on ORL with 5 training images and SLPP as the regional classifier 48 5.1 Accuracy vs r\ 57 5.2 Regional Weights for Yale at 16 x 16 58 5.3 The weighting of regions using each of the three suggested weighting schemes 59 vn Acknowledgement First off, I would like to thank my friends for continuing to be so, even as I ignored them to work on this thesis. Or while I pretended to work on this thesis, and instead played video games. I would like to thank my family, for all the encouraging (read: nagging) in this and everything else. If I have accomplished anything, it is only because my parents gave me the tools to succeed. Naturally, I would like to thank my fiance Darcy for reminding me that there are other things outside of this thesis, and also for putting up with a grumpy me. Lastly, and most importantly, I would like to thank my supervisor Dr. Liang Chen, who I somehow managed to fool many years ago into thinking I knew what I was talking about. His continued support and effort on my behalf, even when I wasn't sure of myself has been absolutely amazing. I am absolutely convinced that I could not have a better supervisor, and my only regret about finishing this thesis is that it signals an end to my tenure as his graduate student. Dr. Chen, words cannot express how much I appreciate everything you've done. vin Chapter 1 Introduction 1.1 Overview Face Recognition has become a very important and popular field of Computer Science. It has wide ranging applications, from smart homes[69] to entertainment[56] to security[51] [42]. It is a problem that is easy to understand, and very difficult to solve. In 1966, a well known researcher by the name of Marvin Papert assigned an undergraduate student to work over the summer on solving the problem of computer vision, of which face recognition is a small part, thinking this student would be finished by the end of the summer. This student failed completely to accomplish much of anything, and a significant portion of Artificial Intelligence Researchers have spent over 40 years attempting to solve that summer problem[13] Face recognition, in the broadest possible terms, is the process of assigning an identity to an image of a face. However, face recognition is divided into two categories of problems: verification and identification. Verification is a one to one problem, ie it attempts to determine if the image is of a particular person. Identification is a one to many problem, ie it attempts to find the identity of the person 1 in the picture. Although the system presented here is one for identification, I feel it could easily be extended to the verification problem as well. In order to carry out this task, a working knowledge of digital image processing is required. An image is assumed to be composed of pixels. Each pixel corresponds to a point of colour in the image. An image is then an hx w matrix of pixel values, where h is referred to as the height of the image, and w is referred to as the width. Most face recognition systems in general, and the one outlined here in particular, assume that the image is grayscale. Thus, each pixel represents only the intensity of the gray, and can be represented by a single number. Thus, an image can be assumed to be composed of an m x n matrix of positive real valued numbers representing pixel intensities. Visual approaches attempt to perform classification based entirely on the pixels that comprise the image. Other approaches use other biometric data for recognition, such as range data[2] [77] [43], thermal imaging[5], [84] [14] or 3D data[44], [71] [6]. 1.2 Contribution The major contribution of this thesis is a new system for face recognition that is based on adding weights to the already proven system of Regional Voting. Regional Voting has been shown to be very stable in the face of a noisy system, and the weighting algorithms proposed here improve on the already best in class results of Regional Voting, in some cases cutting the error rate in half. The remainder of the thesis is structured as follows: Chapter 2 contains an overview of the field of face recognition and a survey of pertinent literature. Chapter 3 gives a detailed description of the proposed algorithm. Chapter 4 details the experiments performed to verify the performance of the system. Chapter 5 analyzes 2 the results of the experiments and Chapter 6 summarizes the concepts presented and suggests some future directions. 3 Chapter 2 Literature Survey Most humans have little difficulty with face recognition. This ability is mostly developed by the age of one year [62]. Thus, we as humans assume it must be an easy problem to solve. A computer should simply replicate the process that humans go through to recognize a face. However, psychological research has found strong indicators that face recognition is a specialized task [62]. Several case studies have been done on individuals who have lost the ability to recognize faces, but retain the ability to recognize all other classes of objects (for example, McNeil and Warrington[57] and Farah[27]). In fact, there is a marked decrease in the ability of humans to recognize faces when the faces are upside down[82], belonging to other species of primates[68] or even of members of other ethnic groups[58]. Although there is some evidence to support the idea that the latter two cases are caused by a lack of exposure to non members of species or ethnic group, McNeil and Warrington[57] [28] showed that a subject who is unable to distinguish faces when right side up is better at distinguishing them when they are upside down, indicating that right side up face recognition is a specialized process. Thus, it seems face recognition is more than a matter of matching objects that 4 are similar. Furthermore, given the lack of knowledge as to how precisely human face recognition works[62], it is very difficult to design an algorithm to replicate it. As the brain is made of hundreds of billions of neurons, it may also be that the mechanism that the brain uses is infeasible for computational use. So, modern face recognition systems attempt to find different approaches to performing the face recognition problem. In order to create a system that is capable of recognizing human faces, several steps must be handled: 1. Capture the image of a face 2. Digitize the image 3. Locate the face 4. Normalize the location of the face 5. Pre-process the face data 6. Perform face recognition Although all the steps are important, for the system proposed here, the first five steps are assumed to have already taken place. For an overview of each step, please see Gonzalez et al[33]. In particular, we assume that the images are grayscale, aligned based on pupil location, and cropped to a common size. Even within the face recognition step, there are many problems to be dealt with. How can we deal with variation in lighting? How can we recognize that faces belong to the same person when they have different expressions? Can we be sure the system will still identify faces correctly when they are subject to normal cosmetic changes (facial hair change, glasses on or off), or when the face ages? 5 Generally speaking, the problem of Face Recognition has been approached in one of two ways: with holistic approaches or local approaches[91]. Holistic approaches use the entire image as input data. Local approaches attempt to identify salient features of the face and perform recognition based on those. Each has their own advantages and disadvantages. 2.1 Holistic Approaches Holistic Approaches are so named because they incorporate the whole of the image data at once. In general, a holistic method treats the h x w matrix that represents the image as a vector of length hw. This vector can in turn be interpreted as a point in hw dimensional space. Once the data is in vector form, any kind of pattern matching algorithm can be used, such as neural networks[59] or support vector machines[70]. Some approaches attempt to convert the image into a different domain, such as Gabor Wavelets [74]. However, the most common form of holistic algorithms used are based around dimension reduction. A dimension reduction approach requires some training images for each face to be recognized. These are referred to as the gallery. Let Q = {gi,g2,g3, • • g«} be the gallery. Then our training data X can be viewed as a matrix containing n rows of hw vectors: gi g2 X g3 6 The first step for dimension reduction is to solve a generalized matrix problem XP = B where P is a linear projection function represented as a hw x k matrix, and B is a n x k matrix corresponding to the basis of a reduced k dimensional subspace of X. Each of the rows in B is assumed to correspond to the identity of one of the images inX. If we have a new probe image y, we arrange it into a vector of length hw, and compute yp = yP. Then, using some measure, the best match between yp and the rows of B is found. The idea is that an image contains redundant information. The process of projecting it into a lower dimensional space eliminates redundancy and highlights the important information carried by the image. This point will be illustrated through four examples of Holistic Matching that will be embedded in the proposed system: Principal Component Analysis, Fisher's Linear Discriminant, Spectral Regression and Locality Preserving Projections. 2.1.1 Principal Component Analysis Principal Component Analysis (PCA) is a statistical technique for determining a basis for a vector space that accounts for as much of the variance within that space as possible [3]. This approach was first applied by Sirovich and Kirby to the problem of image compression[75, 46]. Shortly thereafter, Turk and Pentland [80] reasoned that if the important information in a facial image could be compressed to a smaller size, perhaps that compressed data could be used to classify the face. They named their approach Eigenfaces, after the nature of the calculation that is required. 7 Let X be the training gallery, as defined above. Then the first step is to calculate n the average face g = / , § « We then create a mean-centred version of X, labelled i=i Xm, where gl ~ g g2-g -A-m g3-g gn-g We then calculate the covariance matrix C of Xm, which can be found to be r< v' Y (2.1) We are trying to maximize variance between data points. This can be formalized as attempting to find a P such that P optimizes P = argmax{||P'CP||} This P can be found through eigenvectors, giving the approach its name. We find the eigenvectors v% and eigenvalues /i, of the covariance matrix, which are the set of vectors and associated values which satisfy Cv% = nlvl However, this calculation is very large, and very time and memory consuming, resulting in the calculation of hw eigenvectors. Furthermore, only at most n eigenvalues will be non-trivial. Instead, we can calculate the eigenvectors a% and 8 eigenvalues j3% associated with the matrix D — XmXm (2.2) of which there are only n. We arrange a% in decreasing order according to /3t. We can recover the original eigenvectors u% by ^z = c x l - g[ Using the notation introduced for the general case of dimension reduction, we let k = n and set P and B In order to classify a new face image y, we simply find ym = y — g and then compute yp = ymP. We then find the identity in B most closely related to yp. In their original paper, Pentland and Turk suggest treating the vectors as points in n dimensional space, and computing the simple Euclidean distance: ||yp—v%\\. The classification is selected as the identity of the image that has the smallest Euclidean distance. This is the metric used in this thesis for this and other holistic classifiers. However, since the introduction of Eigenfaces, other techniques for selecting the closest identity have been suggested. In 1997, Moghaddam and Pentland [60] suggested using the Principal Components, along with other information from the image space to create a probability distribution over the images in the gallery for each probe image. The classifica9 tion can be performed using a maximum likelihood estimator. At the time of its publication, this technique achieved the highest results on the FERET[64] dataset. Li and Lu [50] propose that instead of simply finding the Euclidean distance between the probe image and each image in the gallery, a so called "feature line" should be drawn between the point in space corresponding to each pair of images belonging to the same person. Then, instead of classifying based on the distance between each image, the distance between the probe image and each feature line is used. Li and Lu reported that this approach performed 43.7% to 65.4% better than the standard Eigenface method on a custom dataset, and 81% better on the ORL[8] dataset. It's clear from its long usage that PC A is a feasible approach to face classification. However, the principal components it finds only maximize the amount of variance between each image. At no point does PCA take into account the similarities and differences between faces belonging to the same person. This suggests that another approach, one which includes information about identity in the training process, might show an advantage over PCA. 2.1.2 Fisher's Linear Discriminant The Fisherface technique was developed in 1997 by Belhumeur et al[9] to address the limitations of PCA analysis. Since the principal components only maximize variance between each face, regardless of identity, the criteria it uses for maximizing faces might be responding to environmental factors such as lighting change, differing expressions, eyewear or facial hair. The Fisherface approach finds components that separate the faces based on identity, thus ignoring the effects of environmental factors. The Fisherface technique is based on Fisher's Linear Discriminant[29]. In Fisher's 10 original 1936 paper, he suggests a technique for differentiating between members of two classes based on maximizing the ratio between the differences of the class means and the standard deviations within species. More formally, let X and Xc be defined as in the eigenface approach. However, instead of finding the covariance matrix, and maximizing based on that, we find two matrices, Sw and SB, where Sw is called the within class scatter matrix and SB is called the between class scatter matrix. We assume that we are classifying p separate identities of person. Thus, we can find the covariance of the images within each class. We call the matrix for each class c Sc, where c is the identity of some person to be classified. Using this we can calculate Sw as 1 P Sw — — y s( n c*•—' =l The between class scatter matrix CB can be found by 1 P SB = - ^{Wc -*)-{Wc- x)' where JTC is the mean for class c, and x is the overall mean for all training data. So, we are looking for a projection P that maximizes the between class scatter, and minimizes the within class scatter. Formally, we want a matrix P such that P = a r g %aX|^^| 11 (2 3) ' It has been shown[9] that if we let P a3 ar, where a% corresponds to the zth generalized eigenvector solution of the equation for some eigenvalue /54, then P maximizes equation (2.3). This general approach for finding a projection works quite well, but has a problem when applied to face recognition. The matrix Sw is typically singular as the number of classes p is far far less than the number of dimensions hw. Thus, a projection P can be found that makes \P'Sy/P\ exactly zero. To get around this problem, Belhumeur et al[9] suggest first projecting the data Xm down to a smaller dimensional space using PCA, then reducing it further using Fisher's Linear Discriminant. So, we find our P as follows: P *pca (2.4) *fld Ppca — a r g in ax {PpcaCPpCa\ (2.5) Vfld Ppca $B Ppca Pfld\ Pfld = arg max -.— r r D r r (2.6) \ fld pca W pca*fld\ where C is found according to equation (2.1). The PCA projection can be configured to project into an hw — p dimensional space, thus circumventing the possibility of 12 a singular Sw The P that is found will be of rank at most p — 1, so that the dimensionality of the projection will be p — 1. Thus, if we have five classes, we will project into a four dimensional space; with twenty classes, we will have nineteen dimensions etc. Once we have found our P, classification proceeds precisely as it does with eigenfaces: the original data Xm is projected by B — XcP and then compared with probe images y by projecting y according to yP. The approach outlined above is the Fisherface technique in its simplest form, and will be the approach that is embedded into the proposed system. However, there are many variations. Contemporaneous with Belhumeur et al, Etemad and Chellappa[26] proposed a technique for performing linear dimension analysis for face recognition, but instead of using the Euclidean distance between points in the feature space, they calculate a weight for each dimension in the feature space, and find the distance between a probe image and the gallery images as a weighted sum of the distance along each dimension. They also suggest that their approach can be combined with a wavelet transformation to perform classification based on multiple criteria. Their experiments were performed on an augmented ORL[8] dataset, and are quite impressive, getting a error rate of only 1%. However, no comparison is made to other approaches, and it is unclear exactly how many images are used for training for each individual, or whether the benefit was due to the pre-processing and augmentation applied to the dataset, or the algorithm itself. Similar to the above is a more recent approach proposed by Yang et al[87] called "Fuzzy 2DFLD", which approaches the problem of face classification as a series of two class problems. For each identity, there are two possible classes: either the face belongs to that person or it does not. The Fuzzy 2DFLD approach uses Fisher's 13 linear discriminant for each classification problem. Then, the k nearest neighbours to each gallery image are found. The number of these neighbours that belong to the given class is used as a class membership weight. The class membership is calculated for each gallery image for each class. These weights are then used to modify the scatter matrix used for Fisherface, to project probe images as closely as possible to the class centre. This approach provides a 33% — 50% improvement over the standard Fisherface technique, when applied to Yale[85], ORL[8] or FERET[64] datasets. In 2003, Bressan proposed a non-parametric extension of Fisherfaces[12], based on the observation that Fisher's linear discriminant makes an assumption of normality on the distribution of faces, which may not be valid. His experiments show that the non-parametric assumption does improve face recognition slightly, but not as much as it improves other related fields, such as letter recognition or gender classification. This may indicate that the assumption of normality within classes is not an invalid one. Cai et al[16] suggest an approach that builds on Fisher's linear discriminant by including unlabelled face data to the pre-labelled training data required by Fisherface. The concept is based on Regularized Discriminant Analysis(RDA)[31]. In RDA, the discrimination optimization equation, equation (2.3), is modified to \P'SBP\ P = arg max — - — P \P'bwP + aJ{P)\ where a is a scaling factor and J(W) suggest using the J(W) ,n „s (2.7) corresponds to the model error. Cai et al term to incorporate the structure of the unlabelled data. This approach yielded a significant benefit. Although they do not compare their approach to traditional LDA, it shows a large improvement for face recognition over 14 PCA when tested on the CMU PIE[81] dataset. 2.1.3 Spectral Regression Fisherfaces is very effective at classifying multi-dimensional data, and has a high degree of success with face recognition. However, it suffers from at least two problems above those addressed in the various algorithms listed above. The first is that it requires the calculation of eigenvalues, not once but twice. This is a rather large computation, especially when performed on the entire covariance matrix, as is done for the PCA reduction step. The second problem is the PCA reduction step itself. Although not very much information is lost in the dimension reduction, some is. It may be valuable to try to retain that information in the discriminant analysis phase. Spectral Regression Dimension Analysis (SRDA) [17, 18] is both a framework and an algorithm designed to reduce the computational complexity of not only LDA, but also many other dimension reduction techniques. SRDA builds on the framework of Graph Embedding[86] to present a more efficient LDA algorithm. Yan et al[86] propose the generalized framework of Graph Embedding for solving dimension reduction problem. The framework can handle linear, kernel and tensor reductions, although only the linear approach is treated here. The concept of graph embedding is to treat each image as a vertex in a graph. Thus, if there are n images, there are n vertices in the graph. The edges are represented by the n x n matrix of real numbers W, where the i,jth entry of W is the (possibly zero) edge weight between vertex i and vertex j . The purpose of the graph reduction model is to represent the vertices of the graph as a vector with dimension lower than hw. If B is an n x k matrix that is the reduction of A" to a 15 lower dimensionality k, where bi b2 B = b3 K then the optimal B is given by minimizing T,(bt - b3)*Wti (2.8) hj That is, if vertex % and vertex j have a large weight, then bl and b3 should be close together. It has been shown that these optimal 6s can be found by finding solutions to the equation b = Ih'T/I/hl IbWbl argm ax ^^bT b (2.9) where D is a diagonal matrix whose entries are column sums (that is, DVi Y^ W]i) a n d that the solutions to this equation are the eigenvector solutions to 3 Wb=XDb (2.10) However, since the translation from X to B will be linear, we know that PX = B Thus, we can reformulate the optimization given in equation (2.9) as Ip'X'WXpl argmax -;—,^ ^ ^ , 6 P \p'X'DXp\ (2.11) and the associated eigenvector problem becomes XWX'p = 16 XXDX'p (2.12) Cai et al[17] show that if b = Xp (2.13) is a solution to equation (2.10), then p will be a solution to the eigenvector problem in (2.12). Thus, if we can find B, we can find P, if we can determine the vectors p that are a solution to (2.13), we can avoid having to solve equation (2.12). The advantage to avoiding equation (2.12) is that finding eigenvalues is a computationally expensive procedure, and furthermore, can't be done if XDX' is non- singular, which it is when the number of features in the image is larger than the number of images. So, solving equation (2.13) is more efficient, if the eigenvectors are easier to find. Note that D is always non-singular, so it will always be possible to perform this eigen decomposition. This generalized framework can be made specific through the choice of W. Cai et al [17] provide examples of a W for several linear dimension reduction strategies, including LDA [9], Locality Preserving Projection[38] and Neighbourhood Preserving Embedding[37] In particular, W for LDA can be determined as { — if vertex i and vertex j both belong to the cth class 0 otherwise (2.14) where nc is the number of samples from class c. From this, it is clear that D = I. We can assume without loss of generality that the images in X are ordered according to identity. Then it is easy to see that W is the block diagonal matrix 17 given by where each W^ jy(i) 0 ••• 0 0 W™ ••• 0 0 0 ••• W{p) is an nc x nc matrix. So, we can find the eigenvectors of Wh — XDb by finding the union of the eigenvectors of each of the blocks, and padding them with zeros to get the appropriate length. From examining equation (2.14) it can be seen that there will be an eigenvector bj = [ 1 , 1 , 1 , . . . , 1] with associated eigenvalue 1. This is not a useful eigenvector, as the response of all data points is the same. So we take it along with the other p—1 eigenvectors associated with non-zero eigenvalues, and apply the Gram Schmidt algorithm. Then, we can remove the all ones eigenvector from the orthogonal basis, leaving us with a p - 1 dimensional basis, similar to the LDA approach outlined in section 2.1.2. If we let B be the basis found through the application of the Gram Schmidt process, then we can attempt to find a matrix P such that B = XF In particular, for each b 8 in B, we must find a pt such that ^ = X • p» However, such a pt may not exist. So we can approximate pj by finding the p 18 that minimizes n p = argmin V V x j P — hl3) (2.15) where htJ is the j t h element of vector b, and x3 is the j t h column of X. Equation (2.15) can be minimized by a least squares approximation technique. There are many such algorithms that can handle large scale least squares problems very efficiently, such as LSQR [67]. As the number of faces n is typically far smaller than the original dimensionality of the images hw, the minimization problem in equation (2.15) is ill posed. Thus, there are infinitely many solutions to this equation. In fact, if we introduce a regularizing condition, then we can find only the projective functions that would have been found via the original dimension reduction problem, before being analysed through graph spectral analysis. The regularized least squares optimization problem is given by p n p = argmin y ^ ( x j p — b „ + a||p|| 2 ) u (2.16) where a is a shrinkage parameter. The advantage of this approach is that, while it is more complex and less direct, the number of computations performed is drastically reduced. Furthermore, there is no need to perform a reduction on the face image data prior to performing LDA analysis. Experimental results[17] have shown that performing spectral regression dimension analysis performs roughly on par with Regularized Discriminant Analysis [16], which outperforms LDA. However, Spectral Regression takes approximately l/20th the time of RDA, implying that much more complex methodologies could be embedded in this framework, and even better results could be achieved, without too much of a time penalty. 19 2.1.4 Locality Preserving Projections Another approach that uses the graph embedding model is Locality Preserving Project ions [38] (LPP). Building on the framework outline above, the LPP graph is constructed by placing an edge between two vertices if they are "close." He et al[39] suggest two methods for determining "closeness:" fc-nearest neighbours and e distance. That is, W can be defined by either Wt, = { ^ " > 0 '" (2.17) otherwise or exp ( V WXJ = { ' Xj " 0 J > if j is among the k nearest neighbours of i (2.18) otherwise where rj € M is a tuning parameter. The goal is to minimize equation (2.8), which is J2(k - b,)2Wt, It can be shown that this equation is equivalent to P'XLX'P (2.19) under the assumption that B = PX, and L is the so called "Laplacian Matrix" of W. L can be found by L = D — W where D is a diagonal matrix of column sums of W (see section 2.1.3). 20 Similar to the Spectral Regression approach, equation (2.19) can be minimized by solving the generalized eigenvector problem XLX'p = XXDX'p (2.20) So, the projection matrix P is simply the collection of all eigenvectors p that satisfy equation (2.20). However, like LDA, because the number of pixels is so much higher than the number of face images for training, XLX' is usually singular, which means that equation (2.20) cannot be solved. To get around this problem, as in LDA, principal component analysis is first used to reduce the number of dimensions, before attempting to solve the eigen problem. Locality Preserving Projections have been shown to be extremely effective at solving the face recognition problem. He et al [39] show a dramatic improvement over PCA and LDA for face recognition on the CMU PIE[8l], Yale[85] and MSRA datasets. Since the publication of the Locality Preserving Projection paper, there have been numerous investigations into improving and applying LPP in various ways. In 2006, Cai at al[19] built on the idea of Locality Preserving Projection with Orthogonal Locality Preserving Projections (OLPP). OLPP employs a similar algorithm to standard LPP, constructing the same graph, minimizing the same function, and solving the same eigenproblem. However, once the eigenvectors have been found, the eigenvector corresponding to the smallest eigenvalue is used, and then the algorithm modifies the eigenproblem and re-solves it, finding another set of eigenvectors which are orthogonal to the first. Again, the eigenvector corresponding to the smallest eigenvalue is selected, and the process is repeated k times, yielding k eigenvectors. 21 The OLPP algorithm shows a performance increase over standard LPP, in the Yale [85], ORL [8] and CMU PIE [81] datasets. In some cases, OLPP cut the error rate for LPP in half. However, OLPP, like many dimension reduction techniques, is sensitive to changes in dimensionality. Unlike Fisherface, there is no way to determine what dimensionality should be used. A separate technique for improving LPP suggested independently by Whittier and Qi [83] and by Zheng et al [92] is Supervised Locality Preserving Projection (SLPP). The standard approach to LPP does not take class labels into account. However, the graph construction step can easily be modified to take this information into account. Instead of placing an edge between two vertices if they are close by some metric, an edge is placed between two vertices if they are in the same class, and optionally also if they are close. Zheng et al demonstrate SLPP showing a significant improvement over standard LPP [92]. However, the testing methodology described uses a custom face detection system, as well as gabor wavelets for classification, so direct comparisons with previous results cannot be made. Nevertheless, the results demonstrate the effectiveness of using class labels in the training step. Yu Teng and Liu [88] suggest a similar idea called Discriminant Locality Preserving Projection (DLPP), but take the idea of including class labels a step further. They suggest modifying the objective function for LPP outlined in equation (2.8) to one inspired by a Fisher type approach, namely: ^ (2.21) E(m,- m j )W^ where p is the number of individuals in the database, nc is the number of images 22 belonging to individual c, m 8 is the mean face of individual i and Wf and WE are within class and between class weightings respectively. This objective function can be solved by an eigenvector equation, and from there the algorithm proceeds as standard LPP. Although this algorithm is shown to have improvements over LPP, the improvements are not as dramatic as those demonstrated by either SLPP or OLPP. However, it is impossible to know if this is because it is a weaker algorithm, or simply a result of different testing environments. It seems that DLPP should offer some improvement over SLPP, as DLPP not only minimizes within class scatter, like SLPP, but also maximizes between class scatter, but the literature is not clear on this point. 2.2 Local Approaches Unlike the holistic approaches described above, local approaches do not treat the entire image as a pattern to be classified. Instead, they break the image up into smaller pieces, and perform classification based on these. There are a wide variety of local approaches, which differ not only in how the face is broken up, but how the pieces are compared, and how the overall classification is determined from the classification of the features. Hidden Markov Model (HMM) based approaches use some technique to extract information about the forehead, eyes, nose mouth and chin, and then train a Hidden Markov Model [40] to recognize each individual. The HMM approach was first suggested by Samaria and Young in 1994 [72]. Their approach extracts strips of pixels corresponding to the important features, and trains the model based on this. Hidden Markov Models are designed for one dimensional data; to extend it to a two dimensional case is NP-hard. Thus, the strips are presented as one dimensional 23 data, which loses the vertical spatial relationships. Nefian and Hayes [61] improved upon the speed of Samaria and Young's results by using a Discrete Cosine Transform on the strips of pixels corresponding to the original data. They achieve a slight improvement on the ORL [8] dataset, but see a more than lOx speed up of recognition rate. Later researchers have employed further variations on the concepts outlined here. Hu and Liu [41] suggest a Hidden Markov Model based around the Fast Fourier Transform and the Partial Least Squares approach. Bicego et al [10] suggest using Haar Wavelets[25] instead of DCT, and achieve similar results, but are able to improve on these results through a clever model selection process. In fact, based on their experiments, HMM + Haar Wavelets achieves 100% recognition accuracy on the ORL dataset. A popular method for finding features in a face is based on Scale Invariant Feature Transform (SIFT)[53]. SIFT, as its name implies, generates a large collection of features that are invariant to scale, rotation or translation. These features are extracted from each image, and compared. Images that have a large number of matching features are considered to belong to the same class. SIFT was originally suggested for general object recognition, but has since been applied to Face Recognition by several researchers, first by Mohamed Aly in a term paper[7]. Aly directly follows the approach laid out for general object recognition, and achieves better accuracy than Fisher or Eigenfaces. Bicego et al [11] and Luo et al[54] suggest grid and clustering approaches respectively. In essence, they suggest finding the features of local regions of the image, and classifying based on those. This not only improves the accuracy, but decreases the amount of computational power required to perform the calculation. Majumdar and Ward [55] suggest an approach inspired by Fisher's Linear Discriminant, 24 wherein features that have a high class discriminative power are selected. Geng and Jiang [32] outline two algorithms that modify the features themselves for improved accuracy. Another common approach uses textures to perform some form of matching[76] [52] . One of the most well known approaches to using textures for matching is Local Binary Pattern analysis. 2.2.1 Local Binary Pattern Local Binary Patterns (LPB) were first introduced by Ojala et al [66] in 1996, building on a previous idea proposed by Wang and He [36]. In its most basic form, LPB works as follows: A 3 x 3 window is created around each pixel (except the ones on the outside edge of the image). A pattern is generated starting with the top leftmost pixel, and proceeding clockwise around the centre pixel. If the current pixel is greater than the centre pixel, then the pattern is a 1, if less, then the pattern is a 0. Thus, there are eight binary digits (one for each non-centre pixel in the window), which when concatenated together form an 8-bit integer. This 8-bit integer then represents the pattern surrounding the central pixel. Although this approach shows some positive results for pattern classification, Ojala et al[65] propose some extensions to their original suggestion, to ensure that their classification is rotation and grayscale independent. Their suggestion is twofold. First, instead of using a rectangular 3 x 3 window, use a circular window. The points that do not lie in the centre of a pixel can be estimated using interpolation. Then, since the pattern should be invariant to the selection of the first pixel in the window, all patterns that are the same except for rotation are classified as the same. This leaves 36 distinct patterns. The second idea proposed in [65] comes out of 25 the observation that one type of pattern makes up about 90% of patterns observed in the data. This type of pattern, which they call a uniform pattern, is one in which there are two or fewer transitions from 0's to l's or l's to 0's. For example the patterns 00000000 and 00111000 are uniform, whereas the pattern 01100100 is not. Ojala et al suggest classifying all non-uniform patterns in the same class, thus using only uniform patterns for classification. Their justification is that the remaining 27 patterns don't appear often enough to learn anything useful about their probability. Once the patterns have been found, they are used to construct a histogram, which can then be compared to perform matching. However, Ahonen et al [4] suggest a further improvement for face recognition. LBP as it is provides excellent information about micro-structures, but carries no information at all about the spatial relationship of the data. The suggestion is to break the image up into regions, find a histogram for each region, and then concatenate the histograms from each region together in order to perform global matching. There several ways to compare histograms, including Histogram Intersection: D(S, M) = Y^ min{5,, r , Mhr) (2.22) i,r Log-likelihood: L(S, M) = Y s *,r log MJ>r (2.23) i,r and Chi squared statistic (x2) where 5 and M are the histograms to be compared and St,r is the frequency of pattern i in region r in histogram S. In their experiments, Ahonen et al found 26 that using a x2 statistic worked better than the other approaches, so that is the comparison that they used. The approach outlined above was tested on the FERET[64] dataset, and achieved a high accuracy rate compared to the Eigenfaces technique, in some cases halving the error rate. Like other approaches mentioned in this chapter, various improvements over the original strategy have been suggested. Zhang et al [89] convert the face classification problem into a two class problem by creating a classifier for each individual in the database, and classifying each image as either being of that individual or not. Then the Adaboost[30] algorithm learns the similarity between every face pair. This approach shows a competitive result on the FERET fa/fb partition. Zhang et al [90] suggest combining Local Binary Batterns with Gabor Wavelets. In their approach, each region is convolved with its Gabor filters, to generate so called Gabor Magnitude Pictures (40 for each region). These Gabor Magnitude Pictures are then subjected to Local Binary Pattern analysis, and classification proceeds as normal. The inclusion of Gabor filters dramatically improves on the performance of Local Binary Patterns when tested on the FERET[64] dataset. 2.2.2 Volterrafaces Volterrafaces is a very new technique, suggested by Kumar et al in 2009[48]. Although it could be used holistically, the authors apply it as a local approach, so that is what is followed here. Volterrafaces is based around the idea of the Volterra Series. Volterra series can completely describe any non-linear translation invariant func- 27 tional H : / / —>• H which maps the function x(t) to y(t) with oo „ oo $S(x(t)) = y{t) = ^2 oo „ / • • • / hn(TU ..., r n )x(p represents the convolution operator, and h(t) stands in for all the different orders of kernels. Under this approach, the goal is to approximate a functional H that can map images of faces to identities with a low error. We can evaluate the goodness of H at this task with a so called goodness functional defined as: l^ckecZ^m£ck.n(£ck IMP(x"i) ~ ^ P ( x n ) | | where c is the set of all identities. Using equation (2.27) we can rewrite the above as O(QP) = ^ C f c € c Eijecfc Hx' ®P K ~ X J ®P ^11 zlckec 22meck,nfck \\Xm ®p K - X.n ®p / 2 2Q, K\\2 where K is the Volterra Kernel that will be described shortly. The convolution of the image and the kernel is easiest done if it is linear. Thus, we 28 transform X first, and then perform a linear product with the kernel to approximate Qp. The image x, is represented in two dimensions. Then the transformed version of X, labelled A, is dependent on the order of the approximation. The Volterraface paper outlines first and second order approximations, and provides a framework for higher order approximations (although they are not usually necessary). For a first order approximation, X is transformed into the matrix A by finding a 6x6 neighbourhood of each pixel in X, vectorizing each neighbourhood, and stacking the results to form a new matrix A. For a second order approximation, not only is each pixel in the neighbourhood converted into a vector, but each neighbourhood's vector includes the product of each pixel in the neighbourhood with every other pixel. Now that we have transformed the input data, we can replace xt®pK = A%-K and so our objective function from equation (2.29) becomes v^ u \ ^ ) - v \\ A . K — A • K\\2 v \\A v - K - A • K\\* [ ' or [r J each region r, the result of H(pr) is found. The overall winner is taken to be the identity that has the most regional classifications. The algorithm is given in Figure 2.1. This approach is very straightforward, but it yields very powerful results. Chen and Tokuda[23] show that their approach improves the face recognition accuracy of any holistic approach that is embedded into it on the Yale [85], ORL [8] and CMU PIE[81] datasets, typically halving the error rate. Their experiments show that a region size of between 4 x 4 and 5 x 5 usually yields the best results. The reason suggested by Chen and Tokuda for the improvement is the increase in stability. They define stability as invariance to noise, and use two types of noise: uniform and concentrated. They have shown for both binary images [21] and grayscale using the Hamiltonian distance[22] that Regional Voting increases stability to both types of noise. The reason posited for the increase in stability is that Regional Voting is able to contain noise contamination to the regions affected, so that it takes widespread noise to change the classification of the system. 32 Chapter 3 Proposed Algorithm The proposed algorithm builds on the success of Regional Voting by applying weights to each region. The idea is motivated by the observation that there should be some regions of the face that are more important than others. This concept has been exploited as early as the 1970's in some of the earliest research into face recognition[45]. More recently, Adaptively Weighted Sub-Pattern PCA [78] as well as weighted local Gabor wavelets[93] have been suggested. The system at hand is distinct from these approaches in that the weights are estimated independent of any human knowledge of the structure of the face or of the underlying algorithm. To modify the Regional Voting Algorithm into a Weighted Regional Voting (WRV) algorithm, we assume that each region r has an associated weight wr £ [0,1]. Then the final classification step in Figure 2.1 becomes argmax < Nw r e<7(i7 r = i) > Although it does not deal specifically with face recognition, the paper "Combining Classifiers: A Theoretical Framework" by Kittler et al[47] provides a strong theoretical foundation for why a weighted vote provides the best estimation of the 33 overall classification of several sub-classifiers. Under the assumption that the data provided to each classifier is independent (non-overlapping blocks) and each classifier selects one class for the input, then a weighted vote is the maximum likelihood estimator for the output of the combination of the classifier. In their paper, Kittler et al suggest doing a search over the entire state space to find the best weights for each classifier. Although this is possible, for the case of face recognition, if the image is divided 15 times horizontally and 15 times vertically, there are 255 weights to be estimated, and no suggestion as to how to find these weights. The main contribution of this thesis is the suggestion of several techniques for weight estimation. These weight estimation techniques are simple, and computationally feasible to implement. A very simple approach to estimating the weights is to simply make them proportional to the accuracy of the system. That is, \iright{r) is the number of training images H is able to classify in region r then we can simply use wr = right(r)/n (3.1) where n is the total number of images in the gallery. However, there may be more sophisticated approaches to weight estimation. The concepts used are taken from Regret Minimization, so we take a brief detour before describing the weight estimation techniques. 3.1 Regret Minimization Regret minimization is a technique borrowed from algorithmic game theory for making decisions in the face of uncertainty. Formally, regret minimization works as 34 Figure 3.1: Polynomial Weights Algorithm Require: wf — 1 and pi = l/\A\ for a € A for each time t d o for each action a 6 A d o Let^-^Vtl-r/L^a)) Let p4a = u^/Wt where Wt = ^ if" e n d for e n d for follows: Let A = {ai,a,2, • • •, a n } be the set of allowable actions. At each discrete time t, we assume there exists a function Lt : A —> [0,1], called the loss function, which assigns a loss for each action that we can take. Our goal then is to select actions to minimize the amount of loss we suffer. There is a significant amount of theory surrounding this idea (see [20], [63] for example). Here we distill only what we need. It is fairly plain to see that selecting only one action all the time may not be the best approach. For example, when playing rock paper sissors, it is not a good strategy to always pick rock. But it is a good strategy to randomly choose between rock, paper and sissors. Thus, what we are looking for is a distribution over the actions, p : [0,1] —>• A such that 2_,p(a) = 1 a€A It can be shown the algorithms in Figures 3.1 and 3.2 minimize the expected loss, if the action at time t is selected according to distribution pl. So, p% is the amount of weight distribution p places on action a at time t. wl is similar, but isn't normalized. The parameter r\ is a tuning paramter typically between 0 and 0.1. See [20] for a derivation of this proof and how to find r\ To apply the above approach to face recognition requires a few changes and assumptions. Firstly, we don't have actions take take. Instead, we have regions. Each 'action' corresponds to selecting a particular region as the classifier. Secondly, 35 Figure 3.2: Exponential Weights Algorithm Require: w% = 1 and pi = 1/\A\ for a G A for each time t d o for each action a G A do Let < = < _ 1 e - , ' i t - l ( o ) Let p" = Wt/Wt where W< = \^ w% aeA e n d for e n d for there is no concept of time. Each t corresponds to a training image. Thirdly, the loss function is either 0 or 1: 0 if the classification is correct, 1 if it is not. Lastly, and most significantly, instead of selecting the classification based solely on one classifier, we will use p a s a weight on each region and sum the weighted votes for each classifier to select our overall classification. Essentially, I am adopting the weighting for actions from regret minimization, and applying them to Regional Voting. This means that I've lost the theoretical guarantees of minimization, but have gained the stability of Regional Voting. I will show empirically that this trade-off is worthwhile. 3.2 Estimating Regional Weights As outlined in Section 3.1, I make several changes to the ideas of Decision Theory in order to apply them to Regional Voting. I show the adaptation of the Exponential Weights algorithm, as the derivation for Polynomial is similar. Making the changes outlined in Section 3.1 to the algorithm given in Figure 3.2 yields the algorithm given in Figure 3.3, we see that the function Lr : I —> {0,1} is defined as Lr(gt) = 0 if region r of image gt is classified as the identity of image gt, and Lr(gl) = 1 otherwise. Examining the result of pr after updating for every image, we see that wr is only 36 Figure 3.3: Exponential Weights Algorithm R e q u i r e : w\ — 1 and p\ = l/\A\ for a G A for each gallery image an< ^ instead set p(r) = wr. There are now three possible weighting approaches, given in equations (3.1), (3.2) and (3.3). Although I have used the notations wrong(r) and right(r), I have given no indication as to how to calculate these. Since we are attempting to fit a model to our output, the standard approach is cross validation. In particular, I use leave one out cross validation to estimate the weights, as well as finding the optimal parameter i]. The process is as follows: For each individual in the database, select one image as the probe image. Train 37 the holistic classifier in each region using the remaining images. Then attempt to classify each of the probe images and store the result. Repeat the process by selecting a distinct set of probe images from the gallery. Continue repeating the process until all of the training images have been used exactly once. Then check the accuracy of each region at classifying the probe images. This is used as input for wrong(r) and right(r). Cross validation has the added bonus of allowing the system to estimate an optimal value for 7/. Once the estimated classifications for each image in each region is known, the weights can be estimated using different values of 77. The 77 that results in the best classification rate for the training data is the 77 that is selected, and the corresponding weights used for the final classification. Any of the three weighting equations introduced can be used as part of the algorithm. I will call equation (3.1) Direct Proportional weighting, equation (3.2) Exponential Weighting and equation (3.3) Polynomial Weighting. Thus, the final overall algorithm can be seen in Figure 3.4. 38 Figure 3.4: Weighted Regional Voting Algorithm for each image g% G Q do Divide gt k times horizontally and £ times vertically end for Let 1Z be the set of regions that each image is broken into. for each region r G 71 do Train Hr on Qr for each image a g% G Q do Classify Hr(gt) Record the result end for Find wr using a weighting equation and the results above end for Classify the image by argmax < \ w' eq(Hr (p) a As a timesaver, instead of taking out all images one at a time, one sample from all identities can be removed, and processed simultaneously. 39 Chapter 4 Experiments Extensive experiments were carried out to validate the Weighted Regional Voting approach. For each dataset, the images were divided k times horizontally and k times vertically for k = 1,2, . . . 2 0 . Four different holistic classifiers were used: Eigenfaces[80] (see section 2.1.1), Fisher faces [9] (see section 2.1.2), SLPP[92](see section 2.1.4) and SRDA[17] (see section 2.1.3). Each of the holistic classifiers was trained on the gallery. Probe images were reduced using the linear reduction found during the training phase, and then classified by nearest neighbour classification. Each image to be classified was normalized by scaling the representative vector so that it had a unit sum. All gallery images were perturbed up to two pixels in each every direction, to account for possible misalignment issues. The closest match from among all the perturbed images was selected. All four dimension reduction techniques were embedded using code from Deng Cai's website[15]. For the Eigenface approach, the eigenvectors corresponding to the largest 80% of the eigenvalues were used to calculate the projection function. For the Fisherface approach, the data was first reduced using PCA, and then reduced using Fisher. For the SRDA approach, the a selected for regularization was a = 0.01. For 40 SLPP, cosine similarity was used to calculate the distances in the adjacency matrix. Again, the a selected for regularization was a = 0.01. In order to validate the Weighted Regional Voting approach, three different datasets were used: Yale database[85], the Olivetti Research Laboratory database[8] (ORL) and the Carnegie Melon University Pose, Illumination, and Expression database[81] (CMU PIE). The Yale dataset comes unsurprisingly from Yale University and contains 165 grayscale images in GIF format of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: centre-light, with glasses, happy, left-light, with no glasses, normal, right-light, sad, sleepy, surprised, and wink[85]. The ORL dataset conies from the now defunct Cambridge AT&T Laboratory, formerly Olivetti Research Laboratory. In this database, there are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement) [8]. The CMU PIE database was taken at Carnegie Melon University between October and December 2000. There are 41, 368 images of 68 people. For each person, there are 13 different poses, 43 different illumination conditions and four different facial expressions [81]. The experiments performed here, however, used only the five frontal poses (C05, C07, C09, C27, C29), for a total of 170 images per individual, except for six indivudals with only 169 images each. The faces for all three datasets were normalized by manually finding the eye positions, scaling and translating the faces so that they were aligned on the eyes, and cropping the images to 64 x 64 pixels. This process was done prior to being 41 Figure 4.1: Embedding various holistic classifiers in different sized regions on Yale with 2 training images and exponential weighting Eigenfaces Fisherfaces SRDA SLPP 0 5 10 Divisions 15 20 downloaded from Deng Cai's website[15]. In the first set of experiments, conducted on the Yale and ORL datasets, each of the four holistic classification techniques mentioned above were embedded in the Weighted Regional Voting framework, with different numbers of regions, from l x l to 20 x 20. That is, each image was first left undivided, then divided twice vertically, and twice horizontally, and then thrice each direction and so on up to 20 divisons vertically and 20 horizontally. For these experiments, exponential weighting was used (see equation (3.2)), to demonstrate the validity of weighting in general. The weights were estimated using cross-validation as outlined above. Each algorithm is evaluated being split into training and testing data randomly, in 50 different ways for each number of regional divisions. The same 50 splits were used for each regional division The results for the Yale dataset for 2 and 5 training images are given in Figures 4.1 and 4.2 respectively. The results for the ORL dataset for 2 and 5 training images are given in Figures 4.3 and 4.4. Another set of experiments were run, to test the performance of the various sug- 42 Figure 4.2: Embedding various holistic classifiers in different sized regions on Yale with 5 training images and exponential weighting Eigenfaces Fisherfaces SRDA SLPP Divisions Figure 4.3: Embedding various holistic classifiers in different sized regions on ORL with 2 training images and exponential weighting Eigenfaces Fisherfaces SRDA SLPP 5 fO 15 Divisions 43 Figure 4.4: Embedding various holistic classifiers in different sized regions on ORL with 5 training images and exponential weighting - • - Eigenfaces - * - Fisherfaces -— SRDA -*SLPP 0 5 fO Divisions 15 20 gested weighting schemes across various numbers of divisions. All three suggested weighting schemes - exponential weighting (see equation (3.2)), polynomial weighting (see equation (3.3)) and directly proportional weighting (see equation (3.1)) as well as equal weighting were compared. Equal weighting corresponds to standard regional weighting[22] (see section 2.3). The weighting schemes were tested using by embedding SLPP. As above, each division was tested with the same 50 random splits for each dataset. The results for the Yale dataset for 2 and 5 training images are given in Figures 4.5 and 4.6 respectively. The results for the ORL dataset for 2 and 5 training images are given in Figures 4.7 and 4.8. For comparison, the same tests were run, but instead of finding the best weights based on cross validation, the weights that gave the best results on the testing images was used. The results for the Yale dataset with two and five training images are given in Figures 4.9 and 4.10 respectively and The results for the ORL dataset with two and five training images are given in Figures 4.11 and 4.12 respectively. More tests were done to compare the results of Weighted Regional Voting with 44 Figure 4.5: Comparing various weighting schemes in different sized regions on Yale with 2 training images and SLPP as the regional classifier Equal Direct • Exponential • Polynomial 10 15 Divisions Figure 4.6: Comparing various weighting schemes in different sized regions on Yale with 5 training images and SLPP as the regional classifier Equal Direct • Exponential Polynomial 10 15 Divisions 45 Figure 4.7: Comparing various weighting schemes in different sized regionson ORL with 2 training images and SLPP as the regional classifier Equal 92 Direct - Exponential Polynomial oeg, S-l o