GANS-BASED DATA AUGMENTATION IN CREDIT CARD FRAUD DETECTION
by
Qing Zhao
B.S., Tianjin University of Finance and Economics, 2009
M.S., Tianjin University of Finance and Economics, 2012

PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
COMPUTER SCIENCE

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
December 2024
© Qing Zhao, 2024

Abstract
The landscape of credit card fraud is evolving rapidly, with the emergence of
increasingly sophisticated fraudulent methods. This trend has resulted in a significant uptick
in financial losses incurred by both businesses and consumers. To address the challenge of
credit card fraud detection, the industry has widely adopted machine learning models.
However, building effective models is hindered by limited real-world data and the severe
imbalance between fraudulent and legitimate transactions. In this work, I explore the
application of Generative Adversarial Networks (GANs) to synthesize fraudulent samples
and their superiority compared with traditional data augmentation techniques. To mitigate
potential biases introduced by a single modeling method, Logistic Regression (LR), Random
Forest (RF), and Extreme Gradient Boosting (XGB) are employed, representing different
modeling paradigms. My experiments show that models trained on the GANs-based
synthetic data exhibit superior generalization capabilities and a stronger ability to
discriminate between different classes.

II

Table of Contents
Abstract ...........................................................................................................................................II
Table of Contents .......................................................................................................................... III
List of Tables ..................................................................................................................................V
List of Figures ............................................................................................................................... VI
Introduction ..................................................................................................................................... 1
Background ............................................................................................................................. 1
Aims and Objectives ............................................................................................................... 2
Paper Structure ........................................................................................................................ 3
Chapter One .....................................................................................................................................4
Sampling-Based Researches ....................................................................................................4
GANs-Based Researches .........................................................................................................5
Chapter Two .................................................................................................................................... 7
Data Augmentation Techniques .............................................................................................. 7
GANs ...............................................................................................................................7
SMOTE ........................................................................................................................... 9
ADASYN ........................................................................................................................ 9
Machine Learning Models .....................................................................................................10
Logistic Regression ....................................................................................................... 10
Random Forest .............................................................................................................. 11
Extreme Gradient Boosting ........................................................................................... 11
Evaluation Metrics ................................................................................................................ 12
Confusion Matrix .......................................................................................................... 12
III

Precision, Recall, and F1-Score .................................................................................... 13
ROC Curve and AUC ....................................................................................................14
Chapter Three ................................................................................................................................ 15
Dataset ................................................................................................................................... 15
GANs Architecture and Hyperparameters ............................................................................ 17
Experimental Procedure ........................................................................................................ 19
Division of Dataset ........................................................................................................19
Data Augmentation ........................................................................................................20
Models Construction ..................................................................................................... 22
Chapter Four ..................................................................................................................................23
Validation Results Analysis .................................................................................................. 23
Testing Results Analysis ....................................................................................................... 27
Chapter Five .................................................................................................................................. 37
Conclusions ........................................................................................................................... 37
Further Research ....................................................................................................................37
Reference .......................................................................................................................................39

IV

List of Tables
Table 1 : Confusion Matrix ................................................................................................... 13
Table 2 : The summary of the dataset ................................................................................... 16
Table 3 : Feature Importance Values .................................................................................... 17
Table 4 : Hyperparameters of the Generator and the Discriminator ..................................... 18
Table 5 : The outline of training and testing datasets ............................................................19
Table 6 : The outline of different datasets .............................................................................20
Table 7 : Hyperparameters of each classifier ........................................................................ 22
Table 8 : Precision of validation datasets ..............................................................................24
Table 9 : Recall of validation datasets ...................................................................................24
Table 10 : F1-Score of validation datasets ............................................................................ 25
Table 11 : Precision of the unique test dataset ...................................................................... 28
Table 12 : Recall of the unique test dataset ...........................................................................29
Table 13 : F1-Score of the unique test dataset ...................................................................... 29
Table 14 : AUC from XGB at five augmentation ratios ....................................................... 33

V

List of Figures
Figure 1 : GANs Architecture ................................................................................................. 7
Figure 2 : An example of ROC Curve ...................................................................................14
Figure 3 : Feature Distribution .............................................................................................. 17
Figure 4 : The losses of the Generator and the Discriminator ...............................................19
Figure 5 : Odds in a different group ...................................................................................... 20
Figure 6 : Correlation Matrix ................................................................................................ 21
Figure 7 : Validation F1-Score comparison under different odds .........................................26
Figure 8 : Test F1-Score comparison under different odds ...................................................30
Figure 9 : Comparative analysis chart of data augmentation ratios ...................................... 32
Figure 10 : ROC Curves from XGB model ...........................................................................36

VI

Introduction
Background
With the boom of technological advancements and online transactions, credit card
transactions have become the backbone of modern commerce, offering convenience and
security for both consumers and businesses. However, at the same time, new credit card
fraud techniques have made it difficult to detect fraud on time, thus leading to monetary
losses [1]. According to Islam et al. (2023), losses due to credit card fraud globally have
tripled in the last decade, from $9.84 billion dollars in 2011 to $32.34 billion dollars in 2021.
A report from [3] reveals a global loss from credit card fraud is projected to reach $43 billion
by 2026, affecting cardholders, consumers, and merchants worldwide. Just in 2023, over
1.03 million people were impacted by identity theft.
To address the challenge of credit card fraud detection, the industry has widely
adopted machine learning models trained on historical transaction data. However, building
effective models is hindered by limited real-world data, imbalanced datasets, data drift, and
data overlap [4]. Class imbalance, where one class (fraudulent transactions) is significantly
rarer than the other, is a prevalent issue in credit card fraud detection. Researchers have
primarily focused on two strategies to tackle class imbalance: algorithmic-level and
data-level approaches [5]. Algorithmic-level strategies involve adapting algorithms to handle
imbalanced data, such as cost-sensitive learning [6]. Data-level strategies aim to improve the
distribution of minority class samples through data augmentation, a technique that artificially
expands datasets by generating new training samples from existing data. In this research, the
author focuses on the data-level strategies, specifically speaking, data augmentation
strategies.
1

Data augmentation is crucial for enhancing the reliability and performance of
machine learning models. By increasing the size and diversity of training data, data
augmentation helps mitigate overfitting and improve model generalization. This is
particularly valuable when real-world data is limited or difficult to acquire, as it reduces the
need for time-consuming data collection and labeling processes [1]. Despite the availability
of various sampling techniques for data augmentation, their effectiveness in handling
imbalanced datasets with individual instance significance (one of the major properties in
credit card fraud transactions) is limited [7]. Therefore, Generative Adversarial Networks
(GANs) has drawn a growing interest of researchers due to its remarkable robustness in
addressing challenges like overlapping and overfitting base on the ability to capture intricate
data structures [8].
Aims and Objectives
While GANs has demonstrated remarkable success in various domains, their
application in finance, particularly credit card fraud detection, remains relatively unexplored.
This research aims to evaluate the advantages of GAN-based data augmentation in this
context. To comprehensively assess the effectiveness of GANs, the author compare its
performance with traditional sampling techniques, Synthetic Minority Over-sampling
TEchnique (SMOTE) [9] and Adaptive Synthetic Sampling (ADASYN) [10], using a
real-world dataset of European cardholders [11].
The experiments reveal that GANs can effectively generate diverse and realistic
synthetic data, leading to significant improvements in model precision and robustness. To
ensure a thorough evaluation, Logistic Regression, Random Forest, and XGB are employed,
representing different modeling paradigms.
2

Paper Structure
The remainder of this paper is organized as follows: Chapter One reviews existing
literature on data augmentation techniques. Chapter Two details the research methodologies,
including the experimental setup and evaluation metrics. Chapter Three depicts the
architectural design of this research and describes the implementation procedure thoroughly.
Chapter Four presents the experimental results, comparing the performance of different data
augmentation techniques and machine learning models. Finally, Chapter Five summarizes
the key findings and suggests potential areas for future research.

3

Chapter One
This section provides a comprehensive review of data augmentation techniques for
addressing class imbalance in credit card fraud detection. Both traditional sampling-based
methods and the more recent, innovative GANs approaches are reviewed.
Sampling-Based Researches
Dornadula et al. (2019) [12] used SMOTE and one-class SVM to handle the
imbalanced dataset, measuring the performance using MCC (Matthews correlation
coefficient). This research found that by applying the SMOTE, the classifiers were
performing better than before, and the improvement is most pronounced in logistic
regression classifiers.
Rtayli et al. (2020) [13] proposed a hybrid model, which combined the recursive
features elimination, the hyperparameters optimization, and SMOTE. By performing on
three big datasets. They concluded that their model ensures a good performance whatever the
used datasets.
Tran et al. (2021) [14] tried to cope with the imbalanced dataset by utilizing SMOTE
and ADASYN techniques. They employed four machine learning algorithms to compare the
performance of the balanced dataset produced by fundamental, combined, and graphical
assessment. The conclusion from their research was that the results of the fraud classification
of each ML algorithm based on these two data augmentation techniques are almost similar.
However, the conclusions from research [12][13][14] are all based on the model
training results. In their experiments, the scholars did not split the training dataset and testing
dataset. Hence, the implications of these experiments for real-world scenarios are
questionable.
4

Bagga et al. (2020) [15] used ADASYN to solve the class imbalance. Nine classifiers
were used on the rebalanced dataset. Their experiments demonstrated that even with
ADASYN techniques, the precision and F1-score for the fraudulent transactions were still
far lower than that of the genuine transactions. However, pipelining and ensemble learning
improved the overall performance.
Yang et al. (2019) [16] proposed a federated fraud detection framework with the
SMOTE approach to construct a fraud detection model. The federated fraud detection
framework enables different banks to collaboratively learn a shared model while keeping all
the training data which is skewed on their own private database. The experiment showed that
federated learning achieves an average test AUC 10% higher than traditional systems.
Ingole et al. (2021) [17] used the sklearn.util.resample utility to resample the
fraudulent samples. They explored various sample sizes, finding that the best performance
came when the fraudulent data increased to 20000 from 496. However, It was tedious to find
the oversampling rate for larger datasets as it required testing several sampling sizes.
GANs-Based Researches
Ngwenduna et al. (2021) [18] provided a thorough introduction to GANs and their
application in data augmentation. They conducted a comparative study on the performances
of WCGAN with other resampling methods on 5 publicly available imbalanced datasets.
With the main evaluation results from AUC and AUPRC, they argued that WCGAN was
significantly superior to SMOTE. Additionally, while SMOTE improved the AUC, it
significantly compromised precision.
Strelcenia et al. (2023) [19] investigated a variety of data augmentation techniques to
address the for imbalanced data challenge, including the novel K-CGAN model. Six
5

classifiers were employed to evaluate the performance of these techniques. K-CGAN,
B-SMOTE, and SMOTE consistently outperformed other methods in terms of precision and
recall, while K-CGAN demonstrated the highest overall F1-Score. Additionally, the authors
visualized the data points generated by K-CGAN to compare them with the original dataset,
demonstrating the superior ability of GANs to synthesize realistic data. However, the
authors’ approach of splitting the testing dataset after augmentation introduced a
confounding factor. Comparing results across varying testing datasets from different
augmentation techniques may not provide a reliable assessment of model performance.
Furthermore, the high proportion of synthetic samples in the testing datasets could limit the
evaluation of the model's generalization ability to real-world data.
Ba, H. (2019) [20] built four types of GANs-based frameworks to augment the
fraudulent samples and also compared with traditional resampling approaches. The
experiment results showed that GANs-based frameworks produced more balancing values in
recall and precision, resulting in better F1-Score. Additionally, their findings demonstrated
that GAN-based augmentation did better than other approaches, as it improves generalization
more notably than other training methods.
Asha et al. (2021) [21] proposed a novel framework that leveraged Sparse
Autoencoder and GANs to effectively differentiate fraudulent from non-fraudulent credit
card transactions. This model stood out as a one-class classification technique, eliminating
the need for mixed-type datasets containing both positive and negative instances

6

Chapter Two
Data Augmentation Techniques
GANs
Generative Adversarial Networks (GANs) was originally invented in a landmark
paper

Goodfellow et al (2014) [8]. It is a deep learning architecture, training two neural

networks (Generator and Discriminator) to compete against each other to generate more
authentic new data from a given training samples. Generator network aims to create
synthetic data samples. It takes a random noise vector as input and transforms it into
synthetic data. Discriminator network acts as a critic, trying to distinguish between real data
samples from the training samples and the generated samples produced by the generator.
Figure 1 shows a basic GANs architecture.

Figure 1: GANs Architecture

7

Training a GANs model involves an iterative competition between the generator and
discriminator. At first, the generator initially creates random fake samples. Then the
discriminator receives both real data and the generated fake samples. It tries to classify them
as either real or fake accurately. In this phase, the discriminator is trained. Then in the
second phase, based on the discriminator's feedback, the generator updates its weights to
improve its ability to generate data to fool the discriminator, during this phase, the generator
is being trained and the discriminator is not trainable. By integrating these two phases, the
generator becomes better at creating synthetic data until it is hard for the discriminator to
distinguish real or fake.
minG maxDV(D, G) = Ex~pdata(x) [log D(x)] + Ez~pz(z) [log(1 − D (G(z))]

(1)

The value x and z are sampled from the real data distribution and noise distribution
respectively.
Unlike traditional interpolation techniques, GANs generates new samples by learning
the underlying data distribution, thereby enhancing sample diversity. Whilst GANs is
gaining popularity in many applications, they have notable issues. GANs is notoriously
difficult to train properly and difficult to evaluate, and it suffers from the vanishing gradient
problem, mode collapse, and boundary distortion [18].
Based on framework, loss function, and specific applications, GANs has many
variants such as DCGANs [22] which leverages convolutional and deconvolutional layers,
WGANs [23] which uses Wasserstein Distance instead of Jensen-Shannon divergence to
improve training stability, and cGANs [24] which introduces additional information as input
to both the generator and discriminator.

8

SMOTE
Synthetic Minority Over-sampling TEchnique (SMOTE) was introduced in 2002 [9]
and turns to be most commonly used to overcome the issue of class imbalance. The minority
class is over-sampled by taking each minority class sample and introducing synthetic
examples along the line segments joining any/all of the k minority class nearest neighbors.
Depending upon the amount of over-sampling required, neighbors from the k nearest
neighbors are randomly chosen.
Xnew = X + rand(0, 1) * (Xold − X)

(2)

where Xold − X is the line that identifies the minority Xold in the original dataset with
one of its k neighbors X, and then, the artificial sample points Xnew are selected by linear
random.
While SMOTE excels in handling small datasets, its computational demands can be
substantial for larger ones [18]. Moreover, it is susceptibility to overfitting and noise can
limit its effectiveness, especially when the minority class is inherently noisy or the synthetic
data fails to accurately represent the underlying distribution. Furthermore, new samples
generated by SMOTE without considering the labels of neighboring examples can lead to
class overlap, potentially hindering model performance.
ADASYN
Adaptive Synthetic Sampling (ADASYN) is a variant of SMOTE [10]. Unlike
SMOTE, which generates synthetic minority class samples uniformly, ADASYN assigns
weights to minority class instances based on their level of difficulty to learn. Instances that
are closer to the decision boundary will generate more synthetic samples. Meanwhile,
ADASYN is also applicable to the multiple‑class imbalanced learning challenge [19].
9

However, ADASYN can encounter challenges when dealing with sparsely distributed
minority class instances. If each minority class instance has only a few neighbors within a
specified radius, the generated synthetic samples might be overly concentrated in specific
regions of the feature space, leading to inadequate representation of the underlying
distribution of the minority class. In addition, by concentrating on instances close to the
decision boundary, ADASYN might overrepresent these instances in the synthetic data. This
can lead to a model that is overly sensitive to these borderline cases, potentially reducing the
precision.
Machine Learning Models
Logistic Regression
Logistic Regression (LR) is a statistical method for predicting the probability of a
binary outcome based on one or more independent variables. Because of its simplicity,
Logistic Regression is the most widely used learning algorithm for binary classification tasks.
Forecasts are transformed into probabilities using the sigmoid function [25]. Let, x ∈ Rm
denote the input feature vector of length m, then the response z is given as a straight line z =
w * x + b, where w is the weights and b is the bias term estimated during training. Thus, the
logistic function is given as
g(z) =1/(1 + e−z) ,

0 < g(z) < 1

(3)

The parameters of the logistic regression model are determined by maximum
likelihood. Compared with most machine learning methods with a black-box nature, Logistic
Regression has excelled at interpretability. The coefficients of logistic regression can be
interpreted as the log-odds ratio, providing insights into the relationship between features

10

and the outcome. Moreover, its rapid training and prediction capabilities make it well-suited
for handling large datasets.
Random Forest
Random Forest (RF) is an ensemble learning method introduced in 2001 [26]. It
constructs multiple decision trees and combines their predictions through majority voting for
classification tasks or mean prediction for regression tasks to improve predictive accuracy
and control overfitting. Random Forest introduces randomness at two stages: bootstrap
sampling, where each decision tree is trained on a random subset of the dataset with
replacement, and random feature selection for each split in the decision tree. This technique
makes the decision trees more diverse and helps to reduce correlation between individual
trees. The diversity among the trees reduces its sensitivity to noise and improves its
performance on unseen data, even if there are missing data points. Moreover, Random Forest
is effective in handling large volumes of datasets and unbalanced ones [27], and can provide
an estimate of the importance of different features in the dataset.
Extreme Gradient Boosting
Extreme Gradient Boosting (XGB) is a scalable and efficient implementation of the
gradient boosting algorithm. This algorithm focuses on sequentially improving the model by
focusing on errors made by previous models, starting with one weak learner and iteratively
adding new weak learners to approximate functional gradients. In each iteration, the error
residuals of the previous model are used to fit the next model. The final ensemble model is
constructed by a weighted summation of all weak learners. Instead of picking each data
instance with equal probability as with the “bagging” algorithm, the “gradient boosting”
algorithm makes it more likely to pick instances that the previously trained learners
11

misclassified in each iteration. “Bagging” minimizes the variance and overfitting, while
“Boosting” minimizes the bias and underfitting [25].
The scalability of XGB is attributed to its algorithmic optimizations and system-level
enhancements. Unlike sequential gradient boosting algorithms, XGB constructs weak
learners in parallel, significantly accelerating training. Furthermore, it incorporates tree
pruning and regularization techniques to control model complexity and prevent overfitting.
These combined features make XGB a leading choice for many machine learning tasks,
offering a balance of prediction accuracy and computational efficiency.
Evaluation Metrics
A fraud detection system should be able to maximize the detection of fraudulent
transactions while minimizing the number of incorrectly predicted frauds (false positives). It
is often necessary to consider multiple measures to assess the overall performance of a fraud
detection system.
Confusion Matrix
A confusion matrix is a table that is used to define the performance of a classification
algorithm. Each column of the matrix represents the actual class, while each row represents
the predicted class. There are four possible outcomes of a classifier: True positives (TP) are
instances of the positive class that are correctly predicted as positive by the classifier; True
negatives (TN) are instances of the negative class that are correctly predicted as negative;
False positives (FP) are instances of the negative class that are incorrectly predicted as
positive; False negatives (FN) are instances of the positive class that are incorrectly
predicted as negative [28]. In this study fraudulent transitions are positive class and genuine
transactions are negative class. The confusion matrix is represented as follows:
12

Table 1: Confusion Matrix

Precision, Recall, and F1-Score
Precision and Recall are metrics derived from the confusion matrix, focusing on
ratios within the columns or rows of a confusion matrix. F1-score is defined as the harmonic
mean of the two quantities.
Precision (also known as Positive Predicted Value) measures the accuracy of positive
predictions made by a classifier. It is the ratio of correctly predicted positive instances to the
total number of predicted positive instances.
Precision = TP / (TP + FP)

(4)

Recall (also known as True Positive Rate or Sensitivity) measures the ability of a
classifier to find all positive instances within a dataset. It is the ratio of correctly predicted
positive instances to the total number of actual positive instances.
Recall (TPR) = TP / (TP + FN)

(5)

FPR= FP / (TN + PF)

(6)

The objectives of fraud detection models are to achieve both high precision and high
recall. However, these metrics are often inversely related. In practical applications, F1-Score
as a composite measure of precision and recall is more frequently employed.
F1-Score=2 * Precision * Recall / (Precision + Recall)

(7)
13

ROC Curve and AUC
The aforementioned metrics are contingent upon the classification threshold, which is
a decision boundary that separates the positive and negative classes. The default threshold is
typically set at 0.5. Variations in the classification threshold directly influence the calculated
values of these statistical metrics. In practical applications, the threshold is often calibrated
to align with specific risk management objectives [28].
The Receiving Operating Characteristic (ROC) curve is obtained by plotting TPR
against FPR for all the different classification thresholds. A perfect classifier will have a
ROC curve that hugs the top-left corner, with a TPR of 1 and an FPR of 0. The Area Under
the Curve of ROC called AUC, serves to evaluate the performance of a classifier over a
range of different thresholds.

Figure 2: An example of ROC Curve

.

14

Chapter Three
Dataset
Credit card fraud transactions are characterized by low frequency but high impact,
making them a top priority for financial institutions' risk management. The more effective
fraud prevention measures are, the fewer fraudulent transactions accumulate, resulting in a
scarcity of fraud samples for research. Additionally, due to data security and regulatory
requirements, real-world financial data is difficult to obtain publicly, further hindering the
acquisition of research samples for credit card fraud.
Therefore, this study utilizes the publicly available dataset of European cardholders
in September 2013 downloaded from Kaggle [11], which is also commonly used in credit
card fraud research. The dataset covers transactions over two days, where we have 492
frauds out of a total of 284,870 transactions. The dataset is highly unbalanced, the positive
class account for 0.172% of all transactions[11]. Its imbalanced distribution represents the
actual condition worth being concerned with.
All 30 input variables are numerical with no missing or erroneous values. Feature
“Time” contains the seconds elapsed between each transaction and the first transaction in the
dataset. The feature “Amount” is the transaction Amount[11]. Apart from these two
variables, the remaining 28 variables are derived from the PCA transformation due to
confidentiality concerns on the original information. Tables 2-3 and Figure 3 present the
outline of the dataset.

15

Table 2: The summary of the dataset

16

Figure 3: Feature Distribution

Table 3: Feature Importance Values

GANs Architecture and Hyperparameters
To build the GANs model for this research, the author choose the vanilla GANs architecture
due to its simplicity and representation. Both the generator and the discriminator comprise of
17

four dense layers. For the first third dense layers, each is followed by a dropout layer and a
batch normalization layer. Adding dropout layers is to prevent overfitting while batch
normalization layer to speed up the training process. The last dense layer works as a output
layer with the sigmoid activation function. Scaled Exponential Linear Unit(SeLu) is selected
as the activation function for the Generator because of its good properties in maintaining
gradient stability and enhancing generalization so that it can help the generator produce more
diverse and high-quality samples. As for the Discriminator, the author uses the Leaky
Rectified Linear Unit (LeakyReLU) as the activation function because it allows a small
gradient for negative values which helps to reduce the risk of overfitting. Notably, the
generator's training process does not involve direct interaction with the original training data.
It is guided solely by the feedback provided by the discriminator.
After 100 epochs of training ( batch_size=32 ), the loss of the generator and
discriminator tend to be stable and the generator is used to augment the training set. Table 4
and Figure 4 present the hyperparameters of the generator and the discriminator and their
losses during the training process.

Table 4: Hyperparameters of the Generator and the Discriminator

18

Figure 4: The losses of the Generator and the Discriminator

Experimental Procedure
Division of Dataset
In this study, the author initially divides the dataset into a training and test dataset at
a ratio of 8:2 (Table 5). All data augmentation techniques are exclusively applied to the
training dataset to ensure that the test dataset remained completely unseen, serving as a
benchmark for evaluating the final performance of the model. This approach allows them to
conduct comparative analyses of various data augmentation techniques, augmentation ratios,
and modeling methods under a unified evaluation standard.

Table 5: The outline of training and testing datasets

19

Subsequently, after applying different augmentation techniques to the training dataset,
the author further split 10% of each augmented training dataset into a corresponding
validation dataset. The validation dataset serves a dual purpose: evaluating the model's
performance during training and comparing it to the performance on the test set to assess its
generalization ability.
Data Augmentation
In this experiment, three data augmentation techniques, GANs, SMOTE, and
ADASYN, are employed to augment the training dataset by increasing the proportion of
fraud transaction samples to 10%, 30%, 50%, 80%, and 100% of the normal samples,
respectively (Figure 5 and Table 6). This allows the author to extend the research under
varying sample ratios. After data augmentation, 15 augmented training datasets are prepared.

Figure 5: Odds in a different group

Table 6: The outline of different datasets

20

Based on the degree of fraud samples augmentation (different ratios of positive to
negative samples), the entire experiment can be divided into five groups. Each group
contains four training datasets: three augmented training datasets obtaining from different
augmentation techniques, and one original training dataset for comparison. Figure 6 presents
the correlation matrix of the original training dataset and the three augmented balanced
training datasets.

Figure 6: Correlation Matrix

21

Models Construction
Subsequently, 10% of each augmented training dataset are reserved as a validation
dataset and excluded from model training. The remaining 90% are used to train models
under three machine learning algorithms: Logistic Regression, Random Forest, and XGB.
The performance of these models are evaluated on both the validation and test datasets.
Table 7 demonstrates the hyperparameters of each classifier.

Table 7: Hyperparameters of each classifier

22

Chapter Four
In real-world credit card transaction practices, financial institutions need to both
effectively intercept fraudulent transactions to prevent financial losses and, at the same time,
ensure a good customer experience by minimizing false positives. This means striking a
balance between recall and precision. Therefore, this study employs F1-score, the harmonic
mean of recall and precision, as the primary evaluation metric for subsequent analysis.
Meanwhile, considering that the F1-Score is calculated based on a default classification
threshold of 0.5, AUC is chosen as a supplementary evaluation metric to explore the
classifiers’ performance under different thresholds.
To ensure the robustness of the results, the entire experiment is repeated five times
with different random seeds (1952, 1980, 1986, 2013, and 2024), and the average results are
reported.
Validation Results Analysis
Tables 8-10 present the Precision, Recall, and F1-Score of validation datasets, and
Figure 7 shows the comparison analysis of F1-Score under different odds for data
augmentation.

23

Table 8: Precision of validation datasets

Table 9: Recall of validation datasets

24

Table 10: F1-Score of validation datasets

25

Figure 7: Validation F1-Score comparison under different odds

Based on F1-Score, it can be concluded that regardless of the data augmentation
technique employed, the degree to which the proportion of fraud samples is increased, or the
26

modeling method used, data augmentation generally improves model performance.
Moreover, there is a trend that as the proportion of fraudulent samples increases, the model
performance improves. The only exception is observed when ADASYN is combined with
Random Forest.
Additionally, among the three data augmentation techniques, GANs demonstrates the
best performance, regardless of the classification algorithms used. In contrast, ADASYN
yields the worst results.
From the perspective of classification algorithms, before applying data augmentation
techniques, there is a significant difference in model performance among the three
classification algorithm. XGB performed the best, followed by Random Forest, which
significantly outperformed Logistic Regression. However, after applying data augmentation
techniques, Logistic Regression showed the most significant improvement in model
performance, while the gap between Random Forest and Logistic Regression is almost
eliminated.
Test Results Analysis
While the validation dataset results indicate that data augmentation techniques
significantly improve model performance, it is important to note that the validation datasets
have also undergone data augmentation, resulting in containing a large number of synthetic
samples. The strong performance on the validation dataset does not guarantee models’
ability to generalize to real-world unseen data. The effectiveness of a credit card fraud
detection model should be evaluated using real, unprocessed data. To evaluate the models’
performance in real-world scenarios, a 20% holdout is designated as a test dataset,
untouched by any data augmentation techniques. This section will detail the evaluation
27

results on this pristine test dataset. Tables 11-13 present the Precision, Recall, and F1-Score
evaluated from the unique test dataset, and Figure 7 shows the comparison analysis of
F1-Score under different odds for data augmentation.

Table 11: Precision of the unique test dataset

28

Table 12: Recall of the unique test dataset

Table 13: F1-Score of the unique test dataset

29

Figure 8: Test F1-Score comparison under different odds

When evaluated on the test dataset, ADASYN exhibits the worst generalization
ability. All models trained on ADASYN-based datasets perform worse than the baseline
models (trained on the original training dataset). As for SMOTE, only Random Forest
outperforms the baseline and only when the odds is 1:10. For both data augmentation
techniques, as the proportion of synthetic fraud samples increases further, the models’
30

generalization ability declines. Moreover, models built on SMOTE-based and
ADASYN-based datasets demonstrate a notable trend: the Precision decreases significantly
as the proportion of fraudulent samples increases. This suggests that these two data
augmentation techniques may lead to severe overfitting, as the synthetic samples are highly
similar to the original minority class samples. There seems to be an optimal balance between
the number of synthetic samples and the original minority class samples, beyond which
overfitting becomes more pronounced and worsens with increasing numbers of synthetic
samples.

31

Figure 9: Comparative analysis chart of data augmentation ratios

In contrast, models trained on GANs-based datasets demonstrate superior
generalization ability. Both Logistic Regression and XGB models outperform the baseline
models, while Random Forest, although underperforming the baseline, still achieves
32

significantly better results compared to SMOTE-based and ADASYN-based datasets.
Additionally, another notable characteristic is its insensitivity to the proportion of fraudulent
samples. Unlike SMOTE and ADASYN, GANs does not tend to overfit as the number of
synthetic samples increased. Instead, the models demonstrate remarkable stability across
datasets with varying proportions of fraudulent samples.
ROC curves (Figure 10) and AUC (Table 14) further demonstrate the superiority of
using GANs for data augmentation. At all five augmentation ratios, GANs achieves AUC
scores higher than the baseline, while SMOTE and ADASYN fall below the baseline.

Table 14: AUC from XGB at five augmentation ratios

33

34

35

Figure 10: ROC Curves from XGB model

Furthermore, this experiment reveals that for GANs-based datasets, Logistic
Regression classifiers achieve the best performance on perfectly balanced datasets. In
contrast, for XGB classifiers, generating synthetic minority samples to account for only 10%
of the majority class yield the optimal results. For Random Forest classifiers, the best
practice is found to be a combination with SMOTE, with optimal performance observed
when the minority class constitutes 10% of the majority class.
Ultimately, a comparison of results from validation and test dataset demonstrates that
relying solely on augmented data for model evaluation can provide a distorted view of
performance. Taking the SMOTE experiment as an example, while SMOTE might appear
effective based on validation results, real-world performance reveals its limitations, with
only one out of fifteen experiments surpassing the baseline.

36

Chapter Five
Conclusions
This study investigates the application of GAN-based data augmentation in credit
card fraud detection scenarios compared with traditional SMOTE and ADASYN techniques.
To mitigate potential biases introduced by a single classification algorithm, three machine
learning algorithms are employed. This approach enables the exploration of the combined
effects of various data augmentation techniques and classification algorithms. Two types of
datasets are utilized to assess model performance by the metrics of Precision, Recall,
F1-score, and AUC-ROC. Validation datasets serve to evaluate individual models’ modeling
effectiveness, while the unique test dataset is used to assess the models’ performance in
real-world scenarios and provides a common benchmark for cross-model comparisons.
Experimental results reveal that while all three data augmentation techniques improve
modeling effectiveness to varying degrees, only GANs consistently enhances the models’
performance in real-world applications, regardless of the augmentation ratio. In contrast,
SMOTE and ADASYN demonstrate poor generalization ability, their effectiveness diminish
as the proportion of synthetic fraud samples increased.
Further Research
GANs exhibits diverse architectures and loss functions, each with unique
characteristics. This study just employs the foundational Vanilla GANs structure. Despite its
groundbreaking nature, Vanilla GANs faces several shortcomings such as mode collapse and
vanishing gradients. Future research endeavors will delve into more sophisticated GANs
variants, such as WGANs which helps to improve training stability and address vanishing
37

gradient problems and TGANs which incorporates pre-processing steps like normalization
and categorical embedding to handle mixed data types, to refine GANs performance and
generate higher-quality synthetic data. Moreover, exploring the synergy between GANs and
other machine learning classification algorithms presents an avenue for further investigation.
To comprehensively evaluate the implementation’s performance, a broader range of
evaluation metrics will be considered in future studies.

38

Reference
[1] Strelcenia, E., & Prakoonwit, S. (2023). A survey on gan techniques for data
augmentation to address the imbalanced data issues in credit card fraud detection. Machine
Learning and Knowledge Extraction, 5(1), 304-329.
[2] Islam, M. A., Uddin, M. A., Aryal, S., & Stea, G. (2023). An ensemble learning approach
for anomaly detection in credit card data with imbalanced and overlapped classes. Journal of
Information Security and Applications, 78, 103618.
[3] Chargebacks911.com. 24 Key Credit Card Fraud Statistics to Know for 2024.
https://chargebacks911.com/credit-card-fraud-statistics/
[4] Cherif, A., Badhib, A., Ammar, H., Alshehri, S., Kalkatawi, M., & Imine, A. (2023).
Credit card fraud detection in the era of disruptive technologies: A systematic
review. Journal of King Saud University-Computer and Information Sciences, 35(1),
145-174.
[5] Lucas, Y., & Jurgovsky, J. (2020). Credit card fraud detection using machine learning: A
survey. arXiv preprint arXiv:2010.06479.
[6] Elkan, C. (2001, August). The foundations of cost-sensitive learning. In International
joint conference on artificial intelligence (Vol. 17, No. 1, pp. 973-978). Lawrence Erlbaum
Associates Ltd.
[7] Yeşilkanat, A., Bayram, B., Köroğlu, B., & Arslan, S. (2020). An adaptive approach on
credit card fraud detection using transaction aggregation and word embeddings. In Artificial
Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference,
AIAI 2020, Neos Marmaras, Greece, June 5–7, 2020, Proceedings, Part I 16 (pp. 3-14).
Springer International Publishing.
[8] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... &
Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing
systems, 27.
[9] Nitesh, V. C. (2002). SMOTE: synthetic minority over‐sampling technique. J Artif Intell
Res, 16(1), 321.
[10] He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic
sampling approach for imbalanced learning. In 2008 IEEE international joint conference on
neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). Ieee.
[11] Kaggle: Credit Card Fraud Dataset
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

39

[12] Dornadula, V. N., & Geetha, S. (2019). Credit card fraud detection using machine
learning algorithms. Procedia computer science, 165, 631-641.
[13] Rtayli, N., & Enneya, N. (2020). Enhanced credit card fraud detection based on
SVM-recursive feature elimination and hyper-parameters optimization. Journal of
Information Security and Applications, 55, 102596.
[14] Tran, T. C., & Dang, T. K. (2021, January). Machine learning for prediction of
imbalanced data: Credit fraud detection. In 2021 15th International Conference on
Ubiquitous Information Management and Communication (IMCOM) (pp. 1-7). IEEE.
[15] Bagga, S., Goyal, A., Gupta, N., & Goyal, A. (2020). Credit card fraud detection using
pipeling and ensemble learning. Procedia Computer Science, 173, 104-112.
[16] Yang, W., Zhang, Y., Ye, K., Li, L., & Xu, C. Z. (2019). Ffd: A federated learning
based method for credit card fraud detection. In Big Data–BigData 2019: 8th International
Congress, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA,
USA, June 25–30, 2019, Proceedings 8 (pp. 18-32). Springer International Publishing.
[17] Ingole, S., Kumar, A., Prusti, D., & Rath, S. K. (2021). Service-based credit card fraud
detection using oracle SOA suite. SN Computer Science, 2, 1-9.
[18] Ngwenduna, K. S., & Mbuvha, R. (2021). Alleviating class imbalance in actuarial
applications using generative adversarial networks. Risks, 9(3), 49.
[19] Strelcenia, E., & Prakoonwit, S. (2023). Improving classification performance in credit
card fraud detection by using new data augmentation. AI, 4(1).
[20] Ba, H. (2019). Improving detection of credit card fraudulent transactions using
generative adversarial networks. arXiv preprint arXiv:1907.03355.
[21] Asha, R. B., & KR, S. K. (2021). Credit card fraud detection using artificial neural
network. Global Transitions Proceedings, 2(1), 35-41.
[22] Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with
deep convolutional generative adversarial networks. arXiv 2015. arXiv preprint
arXiv:1511.06434.
[23] Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative
adversarial networks. In International conference on machine learning (pp. 214-223). PMLR.
[24] Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv
preprint arXiv:1411.1784.
[25] Isangediok, M., & Gajamannage, K. (2022, December). Fraud detection using
optimized machine learning tools under imbalance classes. In 2022 IEEE International
Conference on Big Data (Big Data) (pp. 4275-4284). IEEE.

40

[26] Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.
[27] Bin Sulaiman, R., Schetinin, V., & Sant, P. (2022). Review of machine learning
approach on credit card fraud detection. Human-Centric Intelligent Systems, 2(1), 55-68.
[28] Le Borgne, Y. A., Siblini, W., Lebichot, B., & Bontempi, G. (2022). Reproducible
machine learning for credit card fraud detection-practical handbook. Université Libre de
Bruxelles.

41