UTILIZING MULTI-TASK CASCADED CONVOLUTIONAL NETWORKS AND
RESNET-50 FOR FACE IDENTIFICATION TASKS

by

Chuanyang Cai
B.Sc., The University of British Columbia, 2018

PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
COMPUTER SCIENCE

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
December, 2021

@ Chuanyang Cai, 2021

Abstract
In recent years, computer-animated characters have been designed more and more vivid and
lifelike, and many of them are extremely similar to real people. Due to the high degree of
similarity, some classical face recognition models are mixed up in the face identification
process. For example, FaceNet model matches cartoon facial images with their similar real
faces. In this case, some people may try to cheat by using virtual faces when they are identified
by face recognition systems. To address this problem, this paper proposes an integrated
approach that utilizes Multi-task Cascaded Convolutional Networks (MTCNN) and Resnet-50
models for the classification of real and cartoon faces (or virtual faces) of an input image before
the face identification task. Our experiments show that the proposed integrated approach
achieves better results on face identification tasks compared to some classical face recognition
models that accomplish the tasks directly.

ii

Contents
Abstract………………………………………………………………………………………ii
List of Figures………………………………………………………………………………..v
Acknowledgments………………………………………………………………………......vii

Chapter 1: Introduction……………………………………………………………………..1
1.1:

Overview …………………………………………………………………………...1
1.1.1: Digital Image Processing ................................................................................ 3
1.1.2: Convolutional Neural Network (CNN) .......................................................... 4

1.2:

The Inspiration and Motivation ................................................................................ 7
1.2.1: Computer Animation and Object Detection ................................................... 7
1.2.2: Computer Animation and Face Identification ................................................ 8

1.3:

Research Contribution ............................................................................................ 10

1.4:

An Overview of the Rest of This Project ............................................................... 11

Chapter 2: Literature Review……………………………………………………………...12
2.1:

Multi-task Cascaded Convolutional Networks and Face Detection ....................... 12

2.2:

Residual Network and Classification ..................................................................... 14

2.3:

FaceNet and Face Identification ............................................................................. 17

iii

Chapter 3: Methodology and Experiments……………………………………………….19
3.1:

Collect Facial Images and Set Up Datasets ............................................................ 19

3.2:

Create the Cartoon and Real Faces Classifier ........................................................ 20
3.2.1: Cross Entropy Loss: ..................................................................................... 22

3.2:

Set Up Database for the Face Identification Tasks ................................................. 22

Chapter 4: Research Contributions and Potential Applications………………………...25
4.1:

The Classification of Image .................................................................................... 27

4.2:

The Photo Transmission Between Smartphone and Laptop ................................... 27

Chapter 5: Conclusion and Future Work…………………………………………………29
5.1:

Conclusion .............................................................................................................. 29

5.2:

Future improvements .............................................................................................. 32

References…………………………………………………………………………………...33

iv

List of Figures
Figure 1: Figure 1(a) is a grayscale image of the number eight, and 1(b) has more specific
brightness information for each pixel [18]………………………………………… 3
Figure 2: This diagram illustrates a typical architecture of the convolutional neural network
[19].………………………………………………………………………………….5
Figure 3: The Object detection results by applying the SSD model (The SSD model can
capture both real and carton people from an image)………………………………...8
Figure 4: In this figure, the image on the left is a small dataset that is used to save some
registered faces. The image on the right is the result of cartoon face identification by
applying the FaceNet model.………………………………………………………...9
Figure 5: The pipeline of multi-task Cascaded Convolutional Networks [6]…………...12
Figure 6: The two residual blocks of the Resnet architecture [5].………………………15
Figure 7: The model structure of FaceNet [15].…………………………………………17
Figure 8: The triplet loss optimization of the FaceNet model [15].……………………..18
Figure 9: In Figure 9(a), we only use the MTCNN model to detect the faces, and this model
can capture both real and cartoon faces. In Figure 9(b), we integrate the Resnet-50
model as a classifier to the MTCNN model so that they can classify real and cartoon
faces from an image.………………………………………………………………..21
Figure 10: This is a database user interface which is created by the Qt-Designer, and all
the images will be saved in the SQLite, which is a database connected with PyCharm.
………………………………………………………………………………………23
Figure 11: The image (a) shows a cartoon face identified result which is obtained by
utilizing the integrated method, and Image (b) is a real face identified result….…..24

Figure 12: The flow diagram of the face recognition process by applying our proposed
method.……………………………………………………………………………...26
v

Figure 13: The process of image transmission by applying our proposed method.……..28
Figure 14: The generated result of utilizing the proposed method to test monkey facial
image.……………………………………………………………………………….30
Figure 15: The generated result of utilizing the proposed method to test scenery pictures.
………………………………………………………………………………………31

vi

Acknowledgments
I would like to acknowledge and give my warmest thanks to my supervisor, Professor Chen,
who made this work possible. His guidance and advice carried me through all stages of writing
the project. I would also like to thank the members of my committee, Professor Jiang and
Professor Sui, for their excellent comments and suggestions.

vii

Chapter 1
Introduction
1.1:

Overview

In the field of computer vision, face recognition technology is one of the popular
applications that has been studied for more than 50 years [1]. Over the past few decades,
scientists have continued to improve face recognition techniques. American researcher
Bledsoe first used a semi-automatic method for face recognition. Later, some scientists
tried to use support vector machines (SVMs) or random forests to improve the efficiency
and accuracy of face recognition [2]. Currently, the combination of deep learning methods
and digital image processing techniques has been widely used to replace previous methods
due to their ability to achieve higher accuracy in face recognition tasks [1]. One of the most
popular techniques in the field of deep learning is the convolutional neural network (CNN),
which is the main method applied in this project.
Recently, face recognition technology has been widely used around the world to solve
different problems. A typical example is that this technology can be used in surveillance
systems to help people automatically locate a specific individual face in a photo or video,
which is more efficient and accurate than a manual search. Before starting this project, it
is necessary to define face detection and face recognition. In fact, these two concepts are
1

different. The former is always the first step of face recognition and is mainly used to find
every face in an image. The algorithms for these two concepts also different. For example,
in this project, we use the Multi-task Cascaded Convolutional Networks (MTCNN) for
face detection tasks. While for face recognition tasks, we mainly utilize FaceNet algorithm.
To make the whole process clearer, we summarize the face recognition technique in the
next paragraph.
Generally speaking, the face recognition technique can be divided into three stages.
1) The first stage is known as face detection and is used to locate a face from the image.
In this stage, the detector will use a bounding box to mark each face in the image.
Based on the bounding box, the detector will also return the coordinates of each face.
2) In the second stage, after obtaining the position of the face, we must extract facial
features from the face and then convert them into a feature vector by image processing
techniques. This stage is generally known as facial feature extraction.
3) In the third stage, we must use the facial feature vector computed from the second
stage, so that it can be compared with other feature vectors. For each comparison, we
will get a similarity score which is then used to express the probability that they belong
to the same face. Therefore, the third stage is also known as face matching or
identification.

2

1.1.1:

Digital Image Processing

Before introducing the convolutional neural networks, it is necessary to know some
background knowledge about digital image processing techniques. Digital image
processing is a method used to convert an image into a two-dimensional function,

( , ),

where x and y refer to the plane coordinates [3]. Within the coordinate system, the value of
at any pair of coordinates ( , ) is called the gray level of the image at that point. Each

element has a specific location and a gray level, and these elements are also called pixels.
The value of each pixel always ranges from 0 to 255 (i.e., black to white). In the case that
an image contains a finite number of pixels, it is called a digital image [3].

Figure 1: Figure 1(a) is a grayscale image of
the number eight, and 1(b) has more specific brightness information for each pixel [18].
For example, Figure 1(a) represents a grayscale image of the digital number 8. The
grayscale image implies an image that contains only luminance information [18]. Generally,
3

in a grayscale image, the bright parts (i.e., close to 255) are white; while the dark parts (i.e.,
close to 0) are black. In Figure 1(a), we can find that the number 8 is white, while the
background of this image is black. In addition, there are many small white and black
squares in this image, which are called pixels. In Figure 1(b), for the height(y) of this image,
we can count that there are 24 pixels. And for the width of this image, there are 16 pixels.
Therefore, this image can be transformed into a two-dimensional function

(16, 24). In

addition, it can be found that each pixel has a value between 0 and 255. For each pixel
value, the white pixels are close to 255; while the black pixels are close to 0. By using a

computer and some algorithms, the obtained digital image can be processed. In the field of
computer vision, one of the typical algorithms is known as convolutional neural network,
which will be described in the next paragraphs.

1.1.2:

Convolutional Neural Network (CNN)

The concept of the convolutional neural network was first introduced by a computer
scientist named Yann LeCun in the 1980s [4]. At that time, an early version of CNN was
named LeNet and could be used to recognize handwritten digits [4]. However, CNN
technology was not widely used due to the limitation of computing hardware for training
network models. It was not until 2012 that Alex Krizhevsky used graphics processing units
(GPUs) to train the model and designed the AlexNet, which achieved a top-five error of
15.3% in the ImageNet Large-Scale Visual Recognition Challenge (LSVRC) [4]. Since
4

then, more and more scientists have started to revisit the CNNs and have improved them
significantly over the past decades.
With the rapid development of deep learning, convolutional neural networks have
become the main method for face recognition. To achieve better performance, computer
scientists have proposed a variety of convolution neural network architectures. For
example, a Facebook research group created a famous CNN architecture, DeepFace, for
recognizing human faces in digital images [13]. The researchers used more than 4.4 million
facial images to train their CNN model. Finally, by using the DeppFace model, the accuracy
of face recognition reached about 97.35% [13]. Another famous face recognition model
was built by a research group at the Chinese University of Hong Kong called Deep
Identification-Verification Features (DeepID2). This model allows face recognition
accuracy to reach a higher value of about 99.15% [14].

Figure 2: This diagram illustrates a typical architecture of the convolutional neural
network [19].
In general, the main architecture of CNN model is divided into two parts, as shown in
Figure 2. The first part is the feature extraction layer and the other is called the

5

classification layer.
1) In the feature extractor, many extraction layers are used to receive the outputs from
their immediate previous layers as the inputs and send their outputs as the inputs to the
next layers.
2) The classification layer, also known as the fully connected layer, is used to compute
the score for each category from the feature extraction layer. In this layer, a technique
called stochastic gradient descent (SGD) is used to optimize the objective function.
Then, a backward propagation method is applied to update the input weights. Finally,
both methods are executed recursively until the losses are minimized and the final
outputs are obtained.
In addition to the two components mentioned above, another essential component of
the CNN model, namely activation function, is also included in the classification layer. One
of the most important reasons to explain the necessity of the activation function for the
CNN model is that it can make the network become non-linear. In other words, the
inclusion of the activation function can make the CNN network more easily adapt to some
non-linear models. In the field of deep learning, there are some commonly used activation
functions. Among them, the ReLU function is often used to solve the problem of gradient
vanishing. For the binary classification task, we usually use the Sigmoid function as the
solution. And for the muti-class classification task, the Softmax function is preferred.

6

1.2:

The Inspiration and Motivation

In the previous section, some basic knowledge about digital image processing and CNN
architecture has been presented. In this section, we will briefly describe the inspiration and
motivation that make me do this project.

1.2.1:

Computer Animation and Object Detection

Today, watching computer animation is a popular hobby for most teenagers and adults
around the world. One of the major reasons is that some cartoon characters, created by
computers, are becoming more and more vivid and realistic. For example, on the left side
of Figure 3, two cartoon characters are very similar to real people. Due to their high
similarity, some object detection CNN models, such as Single Shot Multi-Box Detector
(SSD), can also capture cartoon characters when they are used to detect real people (as
shown in Figure 3). Therefore, we decided to give this model a new capability to classify
real people and cartoon characters when it is used to detect them. However, due to the
difficulty in collecting full-size images of people, we decided to focus on the collection of
facial images instead.

7

Figure 3: The Object detection results by applying the SSD model (The SSD model can
capture both real and carton people from an image).

1.2.2:

Computer Animation and Face Identification

In recent years, the accuracy of the face recognition technology has been improved
dramatically, and in 2020, the best face recognition system had an error rate of 0.08% [7].
However, we find that some classical face identification models may match cartoon faces
with their similar real faces. For example, in Figure 4, we try to use the FaceNet model for
an identification task. In this example, we choose a famous Chinese actor named Zhan Xiao
and his cartoon facial image. Based on the matching name (as shown on the button of the
blue bounding box), we can understand that the FaceNet model matches Zhan Xiao’s
cartoon face (i.e., the image on the right) with his real face (i.e., the 51st image in the
database on the left). Therefore, this result demonstrated that the FaceNet model could
match a cartoon face with a similar real person’s face. From this experiment, it can be found
that some cartoon or virtual faces can negatively affect the accuracy of the face recognition
task. Therefore, we decided to propose an integrated approach to deal with this situation.
8

Figure 4: In this figure, the image on the left is a small dataset that is used to save some
registered faces. The image on the right is the result of cartoon face identification by
applying the FaceNet model.

In this paper, we mainly focus on the introduction of the deep learning methods of face
detection and identification tasks. In the field of deep learning, there are a variety of CNN
models used to solve the above two problems. In the literature review section, we will
introduce one of the classical face-detection models named Multi-task Cascaded
Convolutional Networks (MTCNN). This model was proposed by Kaipeng Zhang and his
team members in 2016 [6]. To further improve its performance, we added another popular
model, named Resnet-50, as a classifier. The final results show that the integration of the
two network models can classify real and cartoon faces from the input images. For face
identification tasks, we choose the FaceNet model to match faces, and it is also introduced
in the literature review section.
9

1.3:

Research Contribution

The main contribution of this project is the adoption of an integrated method to improve
the accuracy of some face recognition models that are utilized in the special case (i.e., To
detect and identify real and cartoon faces). In the face identification process, according to
our experiments, some classical face recognition models cannot avoid matching a vivid
cartoon or virtual designed face with its similar real personal face (shown in Figure 4). In
order to deal with this situation, we use the proposed integrated approach to examine the
input facial images and filter out each cartoon or virtual designed face before doing the
face identification tasks. Figure 11 illustrates that better results can be obtained on the face
identification tasks when using our proposed approach.

10

1.4:

An Overview of the Rest of This Project

Chapter 2 briefly discusses several important models and theories that are helpful for
this project. The first model is a face detection model named Multi-task Cascaded
Convolutional Networks. And a classification model named ResNet-50. Finally, a face
identification model named FaceNet.
Chapter 3 describes our proposed integrated approach and shows some relevant
experimental results;
Chapter 4 discusses the main contribution of this project and two potential applications
based on our proposed integrated approach;
Chapter 5 summarizes this project and concludes several future improvements.

11

Chapter 2
Literature Review
2.1:

Multi-task Cascaded Convolutional Networks and Face

Detection
In the paper [6], Kaipeng Zhang and his team members proposed a new framework utilizing
the unified cascaded CNNs by multi-task learning (MTCNN), which integrates face
alignment and detection for better performance on face detection. In addition, they
proposed an efficient automatic online hard sample mining strategy to replace the manual
sample selection in the learning process [6].

Figure 5: The pipeline of multi-task Cascaded Convolutional Networks [6]
12

This proposed model consists of three networks. Before the input of the image to the
three-stage cascaded networks, the image is resized to different scale to create an image
pyramid (i.e., the first step in Figure 5). The resize-factor generally ranges from 0.70-0.80,
which is because that the value of this range can avoid the missing of some small faces in
the detection process. After obtaining an image pyramid, it will be inputted to the first
network named Proposal Network (P-Net). As a full convolutional network, this network
is used to obtain the candidate windows and their bounding box regression vectors. After
obtaining these candidate windows, they use the non-maximum suppression (NMS) to for
the reduction of some overlapped candidates. After that, the remaining candidates will be
inputted to the second network called Refine Network (R-Net). This network is used to
filter some false candidates from the first stage. In this stage, they also use the NMS to
merge some highly overlapped candidates. In the third stage, the processed images will be
inputted to the network named Output Network (O-Net). Though this network is similar to
the second one, it mainly focuses on obtaining more details of the faces in the image. At
last, the position of the five facial landmarks will be outputted from the network.
Due to the fact that the MTCNN model has three independent networks, the training
process is also slightly different. Initially, they have to utilize the data to train the P-Net.
When they obtain the output trained model from the P-Net, they can input it to the next RNet for training. At last, they have to use the output from the R-Net to train the O-Net. In
short, the training output of the previous network acts as the training input of the following

13

network. In each network, it has three different tasks, i.e., face classification, bounding box
regression, and facial landmark localization. For each task, the authors utilize different loss
function for optimization.
1) In the face classification task, they use the cross-entropy loss for each sample

where

= −(

log( ) + (1 −

:

)(1 − log⁡( )),

refers to the probability used to indicate whether the sample is a face or not, and

⁡{0,1} denotes the ground-truth label.

2) In the bounding box regression task, they use the Euclidean loss for each sample
=

where

:

,

⁡−

refers to the regression target obtained from the network, and

denotes

the ground-truth coordinate.
3) In the facial landmark localization task, they also use the Euclidean loss for each
sample

where

:
=

⁡−

,

refers to the facial landmark’s coordinate obtained from the network,

and

represents the ground-truth coordinate.

2.2:

Residual Network and Classification

In the deep learning field, the depth of a neural network has a significant influence on the
accuracy of the final outputs. With the development of computational hardware, many
14

researchers try to deepen their network architecture. However, they also realized that when
the depth of a network approaches a particular threshold value, the output accuracy of this
network will decrease. At the beginning, some scientists believe that the decrease of
accuracy due to the vanishing gradients would occur when the depth of the network
architecture reached a particular point. In 2015, in the paper [5], Kaiming He and his team
members did not believe that the problem of vanishing gradients was the key factor of the
decrease of accuracy as the deepening of the network, which is because that this problem
had been largely addressed by normalized initialization [8] [9] [10] [11] and intermediate
normalization layers [12]. According to the experiment, they believe that the reason why
deeper network shows a worse performance is due to the degradation (i.e., deeper neural
networks are more difficult to train). To solve this problem, they proposed a residual
learning framework. For the residual neural network, there are several different
architectures. While in this paper, we only focus on the introduction of the 50-layer residual
network (i.e., Resnet-50).

Figure 6: The two residual blocks of the Resnet architecture [5].
15

In the paper [5], the authors introduce two residual blocks (as shown in Figure 6) used
for different depth of networks. In Figure 6(a), the residual block is generally used in the
shallow networks of 18-layers and 34-layers, respectively. For the residual block in Figure
6(b), it is used in the deeper networks (i.e., 50-layers, 101-layers and 152-layers), and we
will only consider this block as the key architecture in the Resnet-50 model.
In general, for a stacked layers network, assume the input is
this stacked layers will be denoted as

, the output mapping of

( ).⁡In the residual network, the authors propose a

new idea that they can divide the input processing into two parts. In Figure 6(b), the input
of this block is
as

, and there are two mappings can be used to get its outputs. One is denoted

( ) which represents non-linear mapping, and the other is an identity mapping

denoted as

. At last, they will add the results of nonlinearity and identity mappings

together to recast the original one (i.e.,
Therefore, they can also be written as

( )) into
( )=

( )+

( )−

(i.e.,

( ) = ( ) + ).

for the non-linear mapping

called residual mapping. According to their experiments, they finally find that it is easier
to optimize the residual mapping than to optimize the original mapping. The authors use
the ImageNet dataset for the evaluation of the residual network, and the ensemble of this
network achieves 3.57% error on the ImageNet test set. This error rate was also the lowest
one in the ILSVRC 2015 classification task.

16

2.3:

FaceNet and Face Identification

In the paper [15], Google proposes a new face recognition model, called FaceNet, to
improve the efficiency when implementing face identification at scale. The FaceNet model
uses the Inception network, a popular convolutional neural network architecture in the
computer vision field, to process the input images [16]. Each facial image will be mapped
to a 128-dimensional byte feature vector by using the Inception network. After that, the
authors add the

normalization and Euclidean embedding for each mapping vector,

indicating that an image

will be mapped into a 128-dimensional Euclidean space, and

its embedding can be represented as

. The overall process is shown in Figure 7.

( )

Figure 7: The model structure of FaceNet [15].
After obtaining the embedding

( ), the main task is to optimize it. In their project,

they use the Triplet Loss to minimize the distance between the two facial vectors. To
understand the Triplet Loss, the authors define three vectors in the Euclidean Space, as
shown in Figure 8. The anchor and positive vectors show the same identity, and the
negative vector has a different identity. After learning, they hope to make the squared
distance between the anchor and positive vectors small and that between the anchor and
negative vectors large.

17

Figure 8: The triplet loss optimization of the FaceNet model [15].
In the paper [15], the authors also introduce the calculation of the Triplet Loss
Optimization. They define the anchor image as
negative image as

, the positive image as

, and the

follows:

. Then they transform Figure 8 into the mathematic equation as

where

(

refers to a margin that is enforced between positive and negative pairs. If there

are

training images in the dataset, the overall Triplet Loss ( ) refers to the summation

)− (

)

+ ⁡ <⁡‖ (

)− (

)‖ ,

of each loss:
=

(

)− (

)

+ ⁡ <⁡‖ (

)− (

)‖

At last, the authors use the Labeled Faces in the Wild (LFW) dataset for the evaluation of
the FaceNet model, and their system achieves the highest accuracy (99.63%) of face
recognition.

18

Chapter 3
Methodology and Experiments
In our project, we integrate the three models to accomplish face recognition. The only
difference compared with the traditional face recognition process lies in the fact that we
will filter out cartoon facial images before inputting them to the face identification stage.
According to our experiments, our integrated method exhibits better performance on the
identification of cartoon facial images. The overall process can be divided into three stages:
collecting data, creating a classifier, and setting up a small database.

3.1: Collect Facial Images and Set Up Datasets
In the first stage, we mainly focus on the collection of virtual (or cartoon) and real facial
images from websites to create our datasets. Due to the fact that there are a variety of
factors (such as facial expression, position, orientation, and so on) influencing the accuracy
of face detection, we only collect the frontal faces in this project. Moreover, some cartoon
faces cannot be detected by using the MTCNN model, which is not sensitive to capturing
the 2-dimensional animated face from images. Therefore, we tend to choose more 3dimensional cartoon faces to build our datasets. To filter out some useless factors (e.g.,
background colors, other objects, and so on) from the images, we decided to use the
19

MTCNN model to crop each of them so that we can get the faces only from images. The
cropped images are divided and stored in two folders (i.e., train and test) as our datasets.
We also create cartoon and real folders for train dataset (same as the test dataset) to save
the collected images.

In our project, we manually collect 500 cartoon facial images and 500 real facial
images for our datasets. For the 500 cartoon faces (same as the 500 real faces), we randomly
chose 450 images for the training dataset and 50 images for the test dataset.

3.2: Create the Cartoon and Real Faces Classifier
In the second stage, the main task for us is to utilize the datasets to train the model, which
can be used to classify real and cartoon faces of an image. In general, it requires a lot of
data to obtain a good-performance model (for example, the amount of the required data is
far more than 1000 images) and time for training by the convolutional neural networks. In
that case, to accelerate the training speed and improve the accuracy of classification, we
decided to use transfer learning which is a popular technique in the deep learning field. In
addition, to increase the diversity of our training datasets, we utilize the data augmentation
technique for the processing of each image. For example, we use random brightness,
random horizontal flip, and other methods to process each input image, and these methods
can make some tiny changes for each original image so that the CNNs will regard it as a
20

distinct image.
In this project, for our binary classification task (i.e., cartoon and real faces), we have
downloaded the Resnet-50 pre-trained model from the official website as our basic model
(i.e., the transfer learning). Before the training process, we also have to adjust the final
outputs of the Resnet-50 model into two categorical outputs. After that, we can utilize the
datasets built in the first stage to train the pre-trained model. At the completion of the
training, we can get our model available to be used to classify real and cartoon faces from
an image. Figure 9(b) shows the performance of our trained model, in which we use the
red bounding box to mark cartoon face and the green bounding box to mark real face.
Moreover, on the top of each bounding box, we record the probability of being cartoon or
real face for each image. For example, on the top of the red bounding box, we can see that
(

face.

) = 99%, indicating that this facial image has 99 percent of being a cartoon

Figure 9: In Figure 9(a), we only use the MTCNN model to detect the faces, and this
model can capture both real and cartoon faces. In Figure 9(b), we integrate the Resnet-50
model as a classifier to the MTCNN model so that they can classify real and cartoon
faces from an image.
21

3.2.1:

Cross Entropy Loss:

In the training process, the Cross Entropy Loss function is used for the evaluation of our
model, which will be introduced in this section. In this project, we only have to classify
real and cartoon faces, which means that our situation belongs to a two-classification task.
Assuming that the probability of being a real face is

, then the probability of being a

cartoon face will be 1 − . The formula of Cross Entropy Loss in the binary situation is
expressed as follows [17]:

where

=

∑

= ∑ −[

· log( ) + (1 −

represents each input image, and

input image is a real face, while

) · log⁡(1 −

refers to a label (i.e.,

)],

= 1 means that the

= 0 means that the input image is a cartoon face).

denotes that the probability of the input image belongs to a real face.

3.2:

Set Up Database for the Face Identification Tasks

In this stage, the first task for us is to set up a small database available for the storage of
some real facial images. In this project, we use the Qt-Designer and SQLite to set up the
user-interface, as shown in Figure 10. In the database, we have saved 100 real facial images.
In this figure, we can view each image from the second column of the user interface, and
the third column is used to record the name of each image.

22

Figure 10: This is a database user interface which is created by the Qt-Designer, and all
the images will be saved in the SQLite, which is a database connected with PyCharm.

After setting up the small database, we can use the FaceNet model for some face
identification tasks. The identified image should be a new cartoon or real facial image
which is not contained in the database. Previously, in Figure 4, the matching result has
demonstrated that the FaceNet model is able to match a cartoon facial image with its similar
real personal face. To avoid this situation, we use the integrated method to filter out the
cartoon facial images before the face identification. The result of the application of our
proposed method is shown in Figure 11(a), which demonstrates that cartoon images are not
allowed to be used for face identifications. In addition, the system will pop out a
notification window telling the operators that the given image belongs to a cartoon facial
image. In Figure 11(b), if we choose a real facial image for identification, our proposed
23

method will show the matching rate and the matched name on the bottom of the bounding
box. In the example, the given image is matched with a registered face named Qing Yan at
the matching rate of 81%.

Figure 11: The image (a) shows a cartoon face identified result which is obtained by
utilizing the integrated method, and Image (b) is a real face identified result.

24

Chapter 4
Research

Contributions

and

Potential

Applications
As mentioned in the end of Chapter 1, the main research contribution of this project is that
we have proposed an integrated method applicable in the face recognition system. For this
integrated method, we add the Resnet-50 as a classifier to the traditional face detection
algorithm (i.e., MTCNN model) to make it available for the classification of real and
cartoon faces from input images. After the classification of the real and cartoon faces, the
face identification model will obtain better performance on the face recognition system.
The next paragraph will summarize the entire process of our integrated method.
Since there are variety of cartoon facial images similar to real faces on the websites,
some face recognition systems may match a cartoon facial image with its similar real
personal face. In this case, some people may use their created cartoon facial images for
cheating when they are going through face recognition systems. Therefore, some cartoon
facial images may exhibit negative influences on the accuracy of face identification tasks.
In view of this, we can use our proposed integrated method to classify the real and cartoon
faces before starting the face identification tasks. Figure 12 is a flow diagram which
summarizes the process of our proposed face recognition system. In this system, each input
image will be checked by a classifier (i.e., our proposed method) first. In the case that the
25

image belongs to the cartoon type, it will not be allowed to be used in the face identification
task. While if the image belongs to the real type, it will be sent to the next step to go through
the face identification step. For this improved face recognition system, we initially utilized
our proposed method to filter out the cartoon or virtual facial images from the input images.
After that, there will be no cartoon images influencing the face identification process in
this manner to make the whole system available for more precise results.

Figure 12: The flow diagram of the face recognition process by applying our proposed
method.
Moreover, we would like to introduce two potential applications related to our
integrated method. The following parts of this section will give more concrete descriptions.

26

4.1:

The Classification of Image

The first potential application is that we can use our proposed method for automatic image
classification. Today, most teenagers are fond of collecting different kinds of images by
using their laptops. These images may contain cartoon characters’ screenshots, real
personal selfies, landscape images and so on. Consequently, it will be time-consuming for
them to divide these images into different folders manually. To solve this problem, they
can use our proposed method to automatically create a folder that only contains real
people’s selfies or cartoon characters’ screenshots. Therefore, this potential application can
help people to manage and check images on their laptops more easily.

4.2:

The Photo Transmission Between Smartphone and

Laptop
Another potential application is that we can use the integrated method as a mobile phone
album classifier to help us filter the photos. Today, many teenagers and adults enjoy
watching animations, and most of them get used to collecting some screenshots of the
animation characters in their smartphones. Therefore, their smartphone albums will contain
both real people and animated characters’ photos at the same time. If they are going to
import a particular type of photo (For example, they only want to import real people’s
photos.) from the smartphones to their laptops. In that case, they may have to manually

27

filter out all the cartoon images, which is quite time-consuming. In this case, our proposed
method can be applied to the image transmission process, thereby helping people to filter
out all cartoon photos automatically. Figure 13 summarizes the above description to make
it more concrete.

Figure 13: The process of image transmission by applying our proposed method.

28

Chapter 5
Conclusion and Future Work
5.1:

Conclusion

Recently, the face recognition system has become one of the most useful technologies in
the world. It can not only help people to automate some tedious tasks but also achieve more
accurate results on some face recognition tasks that contain a large amount of data. In
addition, in this paper, we propose two potential applications for our proposed face
recognition method. For example, the technique can potentially be used in image
management systems to help people classify their images. In addition, for the photo
transmission process, people can use the integrated approach to automatically import one
type of photos (i.e., cartoon or real type) from their smartphones to their laptops.
In general, face recognition technology has been widely used in different fields.
However, some face recognition models still need to be improved continuously in the
future. In this project, we proposed an integrated approach to face recognition task using
MTCNN model and Resnet-50. From the experimental results, we can see that our
integrated approach is able to classify real and cartoon faces from images. Moreover, our
integrated method outperforms some traditional methods in face recognition systems when
there are both real and cartoon faces.

29

In addition to giving the face recognition model the ability to classify both real and
cartoon faces. In this project, we also found an interesting condition that should be
improved in the future. Specifically, this condition occurred when we tested some animal
images with our proposed integrated approach. In Figure 14, we can see that the input
picture is a facial image of a monkey. However, the experimental result shows that our
integrated model considers 92% of the given figure as a cartoon facial image, which is a
completely wrong result made by our proposed approach.

Figure 14: The generated result of utilizing the proposed method to test monkey facial
image.
We summarize two key factors for the reasons why our proposed method is able to
produce the above results.
1) The main reason for generating this result is caused by the use of the MTCNN model
30

for the face detection tasks in this project. Specifically, in the user interface, we set a
judgment to check whether or not a given image contains faces (i.e., the faces include
both real and cartoon faces). If there is no face in the image, the system should pop up
a notification window telling the operator that the system cannot detect any faces
(shown in Figure 15). In this figure, we input a landscape image to the proposed
integrated model, and there is no face in this given image. After processing, the system
gives a notification that can be seen in Figure 15. Therefore, if we input an image that
only contains monkey faces, the output should be the same as the result in Figure 15.
However, our experimental result shows that some monkey faces can be detected from
the images using the MTCNN model, which should be improved in the future.

Figure 15: The generated result of utilizing the proposed method to test scenery pictures.
2) Another factor that makes our proposed method think of a monkey face as a cartoon
facial image is that our collected training dataset is not large enough. If we could collect
more cartoon facial images, then the above issue (shown in Figure 14) would not

31

happen again.

5.2: Future improvements
In this section, we will discuss several improvements that may make our proposed
integrated approach better in terms of performance.
1) The first improvement is that we should collect more images for the training process.
In general, training a CNN model requires a large amount of data. However, in this
project, we only collected about 1000 images as our dataset. Although we use the
transfer learning technique to reduce the impact of data shortage, our proposed method
can still get more accurate results if we have a large amount of training data.
2) Another improvement is that we can try to train a new model without adding the pretrained model into it. Although this improvement will be very time- consuming, the
final model may be more flexible in dealing with our problems.
3) According to our experiments, some facial images of monkeys can be detected by the
MTCNN model. However, this condition should not happen in face recognition
systems. In order to deal with problem, we should consider the basic architecture of
MTCNN model and its original training and testing datasets. Some of them should be
improved and modified such that the model can obtain better results on face detection
tasks.

32

References
[1]

Adjabi, I., Ouahabi, A., Benzaoui, A., & Taleb-Ahmed, A. (2020). Past, present, and
future of face recognition: A review. Electronics, 9(8), 1188.

[2]

Kremic, E., & Subasi, A. (2016). Performance of random forest and SVM in face
recognition. Int. Arab J. Inf. Technol., 13(2), 287-293.

[3]

Gonzalez, R. C., & Woods, R. E. (2008). Digital image processing: Pearson
international edition.

[4]

Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M. S.,
Esesn, B. C., Awwal, A. A., & Asari, V. K. (2018). The history began from alexnet:
A comprehensive survey on deep learning approaches. arXiv preprint
arXiv:1803.01164.

[5]

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 770-778).

[6]

Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment
using multitask cascaded convolutional networks. IEEE Signal Processing
Letters, 23(10), 1499-1503.

[7]

Crumpler, W. (2020). How accurate are facial recognition systems---and why does
it matter. Center for Strategic and International Studies, 14.

[8]

Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training
deep feedforward neural networks. In Proceedings of the thirteenth international
conference on artificial intelligence and statistics (pp. 249-256). JMLR Workshop
and Conference Proceedings.

[9]

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE
international conference on computer vision (pp. 1026-1034).

33

[10] LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K. R. (2012). Efficient backprop.
In Neural networks: Tricks of the trade (pp. 9-48). Springer, Berlin, Heidelberg.
[11] Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Exact solutions to the
nonlinear dynamics of learning in deep linear neural networks. arXiv preprint
arXiv:1312.6120.
[12] Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In International conference on
machine learning (pp. 448-456). PMLR.
[13] Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L. (2014). Deepface: Closing the
gap to human-level performance in face verification. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 1701-1708).
[14] Sun, Y. (2015). Deep learning face representation by joint identificationverification. The Chinese University of Hong Kong (Hong Kong).
[15] Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding
for face recognition and clustering. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 815-823).
[16] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 1-9).
[17] Saxena, C. (2021, March 3). Binary cross entropy/log loss for binary classification.
Analytics
vidhya.
Retrieved
October
30,
2021,
from
https://www.analyticsvidhya.com/blog/2021/03/binary-cross-entropy-log-loss-forbinary-classification/
[18] Singh, H. (2021, March 16). How images are stored in the computer? Analytics
vidhya.
Retrieved
October
30,
2021,
from
https://www.analyticsvidhya.com/blog/2021/03/grayscale-and-rgb-format-forstoring-images/

34

[19] Lee, S. Y., Tama, B. A., Moon, S. J., & Lee, S. (2019). Steel surface defect
diagnostics using deep convolutional neural network and class activation
map. Applied Sciences, 9(24), 5449.

35