GANS-BASED DATA AUGMENTATION IN CREDIT CARD FRAUD DETECTION by Qing Zhao B.S., Tianjin University of Finance and Economics, 2009 M.S., Tianjin University of Finance and Economics, 2012 PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE UNIVERSITY OF NORTHERN BRITISH COLUMBIA December 2024 © Qing Zhao, 2024 Abstract The landscape of credit card fraud is evolving rapidly, with the emergence of increasingly sophisticated fraudulent methods. This trend has resulted in a significant uptick in financial losses incurred by both businesses and consumers. To address the challenge of credit card fraud detection, the industry has widely adopted machine learning models. However, building effective models is hindered by limited real-world data and the severe imbalance between fraudulent and legitimate transactions. In this work, I explore the application of Generative Adversarial Networks (GANs) to synthesize fraudulent samples and their superiority compared with traditional data augmentation techniques. To mitigate potential biases introduced by a single modeling method, Logistic Regression (LR), Random Forest (RF), and Extreme Gradient Boosting (XGB) are employed, representing different modeling paradigms. My experiments show that models trained on the GANs-based synthetic data exhibit superior generalization capabilities and a stronger ability to discriminate between different classes. II Table of Contents Abstract ...........................................................................................................................................II Table of Contents .......................................................................................................................... III List of Tables ..................................................................................................................................V List of Figures ............................................................................................................................... VI Introduction ..................................................................................................................................... 1 Background ............................................................................................................................. 1 Aims and Objectives ............................................................................................................... 2 Paper Structure ........................................................................................................................ 3 Chapter One .....................................................................................................................................4 Sampling-Based Researches ....................................................................................................4 GANs-Based Researches .........................................................................................................5 Chapter Two .................................................................................................................................... 7 Data Augmentation Techniques .............................................................................................. 7 GANs ...............................................................................................................................7 SMOTE ........................................................................................................................... 9 ADASYN ........................................................................................................................ 9 Machine Learning Models .....................................................................................................10 Logistic Regression ....................................................................................................... 10 Random Forest .............................................................................................................. 11 Extreme Gradient Boosting ........................................................................................... 11 Evaluation Metrics ................................................................................................................ 12 Confusion Matrix .......................................................................................................... 12 III Precision, Recall, and F1-Score .................................................................................... 13 ROC Curve and AUC ....................................................................................................14 Chapter Three ................................................................................................................................ 15 Dataset ................................................................................................................................... 15 GANs Architecture and Hyperparameters ............................................................................ 17 Experimental Procedure ........................................................................................................ 19 Division of Dataset ........................................................................................................19 Data Augmentation ........................................................................................................20 Models Construction ..................................................................................................... 22 Chapter Four ..................................................................................................................................23 Validation Results Analysis .................................................................................................. 23 Testing Results Analysis ....................................................................................................... 27 Chapter Five .................................................................................................................................. 37 Conclusions ........................................................................................................................... 37 Further Research ....................................................................................................................37 Reference .......................................................................................................................................39 IV List of Tables Table 1 : Confusion Matrix ................................................................................................... 13 Table 2 : The summary of the dataset ................................................................................... 16 Table 3 : Feature Importance Values .................................................................................... 17 Table 4 : Hyperparameters of the Generator and the Discriminator ..................................... 18 Table 5 : The outline of training and testing datasets ............................................................19 Table 6 : The outline of different datasets .............................................................................20 Table 7 : Hyperparameters of each classifier ........................................................................ 22 Table 8 : Precision of validation datasets ..............................................................................24 Table 9 : Recall of validation datasets ...................................................................................24 Table 10 : F1-Score of validation datasets ............................................................................ 25 Table 11 : Precision of the unique test dataset ...................................................................... 28 Table 12 : Recall of the unique test dataset ...........................................................................29 Table 13 : F1-Score of the unique test dataset ...................................................................... 29 Table 14 : AUC from XGB at five augmentation ratios ....................................................... 33 V List of Figures Figure 1 : GANs Architecture ................................................................................................. 7 Figure 2 : An example of ROC Curve ...................................................................................14 Figure 3 : Feature Distribution .............................................................................................. 17 Figure 4 : The losses of the Generator and the Discriminator ...............................................19 Figure 5 : Odds in a different group ...................................................................................... 20 Figure 6 : Correlation Matrix ................................................................................................ 21 Figure 7 : Validation F1-Score comparison under different odds .........................................26 Figure 8 : Test F1-Score comparison under different odds ...................................................30 Figure 9 : Comparative analysis chart of data augmentation ratios ...................................... 32 Figure 10 : ROC Curves from XGB model ...........................................................................36 VI Introduction Background With the boom of technological advancements and online transactions, credit card transactions have become the backbone of modern commerce, offering convenience and security for both consumers and businesses. However, at the same time, new credit card fraud techniques have made it difficult to detect fraud on time, thus leading to monetary losses [1]. According to Islam et al. (2023), losses due to credit card fraud globally have tripled in the last decade, from $9.84 billion dollars in 2011 to $32.34 billion dollars in 2021. A report from [3] reveals a global loss from credit card fraud is projected to reach $43 billion by 2026, affecting cardholders, consumers, and merchants worldwide. Just in 2023, over 1.03 million people were impacted by identity theft. To address the challenge of credit card fraud detection, the industry has widely adopted machine learning models trained on historical transaction data. However, building effective models is hindered by limited real-world data, imbalanced datasets, data drift, and data overlap [4]. Class imbalance, where one class (fraudulent transactions) is significantly rarer than the other, is a prevalent issue in credit card fraud detection. Researchers have primarily focused on two strategies to tackle class imbalance: algorithmic-level and data-level approaches [5]. Algorithmic-level strategies involve adapting algorithms to handle imbalanced data, such as cost-sensitive learning [6]. Data-level strategies aim to improve the distribution of minority class samples through data augmentation, a technique that artificially expands datasets by generating new training samples from existing data. In this research, the author focuses on the data-level strategies, specifically speaking, data augmentation strategies. 1 Data augmentation is crucial for enhancing the reliability and performance of machine learning models. By increasing the size and diversity of training data, data augmentation helps mitigate overfitting and improve model generalization. This is particularly valuable when real-world data is limited or difficult to acquire, as it reduces the need for time-consuming data collection and labeling processes [1]. Despite the availability of various sampling techniques for data augmentation, their effectiveness in handling imbalanced datasets with individual instance significance (one of the major properties in credit card fraud transactions) is limited [7]. Therefore, Generative Adversarial Networks (GANs) has drawn a growing interest of researchers due to its remarkable robustness in addressing challenges like overlapping and overfitting base on the ability to capture intricate data structures [8]. Aims and Objectives While GANs has demonstrated remarkable success in various domains, their application in finance, particularly credit card fraud detection, remains relatively unexplored. This research aims to evaluate the advantages of GAN-based data augmentation in this context. To comprehensively assess the effectiveness of GANs, the author compare its performance with traditional sampling techniques, Synthetic Minority Over-sampling TEchnique (SMOTE) [9] and Adaptive Synthetic Sampling (ADASYN) [10], using a real-world dataset of European cardholders [11]. The experiments reveal that GANs can effectively generate diverse and realistic synthetic data, leading to significant improvements in model precision and robustness. To ensure a thorough evaluation, Logistic Regression, Random Forest, and XGB are employed, representing different modeling paradigms. 2 Paper Structure The remainder of this paper is organized as follows: Chapter One reviews existing literature on data augmentation techniques. Chapter Two details the research methodologies, including the experimental setup and evaluation metrics. Chapter Three depicts the architectural design of this research and describes the implementation procedure thoroughly. Chapter Four presents the experimental results, comparing the performance of different data augmentation techniques and machine learning models. Finally, Chapter Five summarizes the key findings and suggests potential areas for future research. 3 Chapter One This section provides a comprehensive review of data augmentation techniques for addressing class imbalance in credit card fraud detection. Both traditional sampling-based methods and the more recent, innovative GANs approaches are reviewed. Sampling-Based Researches Dornadula et al. (2019) [12] used SMOTE and one-class SVM to handle the imbalanced dataset, measuring the performance using MCC (Matthews correlation coefficient). This research found that by applying the SMOTE, the classifiers were performing better than before, and the improvement is most pronounced in logistic regression classifiers. Rtayli et al. (2020) [13] proposed a hybrid model, which combined the recursive features elimination, the hyperparameters optimization, and SMOTE. By performing on three big datasets. They concluded that their model ensures a good performance whatever the used datasets. Tran et al. (2021) [14] tried to cope with the imbalanced dataset by utilizing SMOTE and ADASYN techniques. They employed four machine learning algorithms to compare the performance of the balanced dataset produced by fundamental, combined, and graphical assessment. The conclusion from their research was that the results of the fraud classification of each ML algorithm based on these two data augmentation techniques are almost similar. However, the conclusions from research [12][13][14] are all based on the model training results. In their experiments, the scholars did not split the training dataset and testing dataset. Hence, the implications of these experiments for real-world scenarios are questionable. 4 Bagga et al. (2020) [15] used ADASYN to solve the class imbalance. Nine classifiers were used on the rebalanced dataset. Their experiments demonstrated that even with ADASYN techniques, the precision and F1-score for the fraudulent transactions were still far lower than that of the genuine transactions. However, pipelining and ensemble learning improved the overall performance. Yang et al. (2019) [16] proposed a federated fraud detection framework with the SMOTE approach to construct a fraud detection model. The federated fraud detection framework enables different banks to collaboratively learn a shared model while keeping all the training data which is skewed on their own private database. The experiment showed that federated learning achieves an average test AUC 10% higher than traditional systems. Ingole et al. (2021) [17] used the sklearn.util.resample utility to resample the fraudulent samples. They explored various sample sizes, finding that the best performance came when the fraudulent data increased to 20000 from 496. However, It was tedious to find the oversampling rate for larger datasets as it required testing several sampling sizes. GANs-Based Researches Ngwenduna et al. (2021) [18] provided a thorough introduction to GANs and their application in data augmentation. They conducted a comparative study on the performances of WCGAN with other resampling methods on 5 publicly available imbalanced datasets. With the main evaluation results from AUC and AUPRC, they argued that WCGAN was significantly superior to SMOTE. Additionally, while SMOTE improved the AUC, it significantly compromised precision. Strelcenia et al. (2023) [19] investigated a variety of data augmentation techniques to address the for imbalanced data challenge, including the novel K-CGAN model. Six 5 classifiers were employed to evaluate the performance of these techniques. K-CGAN, B-SMOTE, and SMOTE consistently outperformed other methods in terms of precision and recall, while K-CGAN demonstrated the highest overall F1-Score. Additionally, the authors visualized the data points generated by K-CGAN to compare them with the original dataset, demonstrating the superior ability of GANs to synthesize realistic data. However, the authors’ approach of splitting the testing dataset after augmentation introduced a confounding factor. Comparing results across varying testing datasets from different augmentation techniques may not provide a reliable assessment of model performance. Furthermore, the high proportion of synthetic samples in the testing datasets could limit the evaluation of the model's generalization ability to real-world data. Ba, H. (2019) [20] built four types of GANs-based frameworks to augment the fraudulent samples and also compared with traditional resampling approaches. The experiment results showed that GANs-based frameworks produced more balancing values in recall and precision, resulting in better F1-Score. Additionally, their findings demonstrated that GAN-based augmentation did better than other approaches, as it improves generalization more notably than other training methods. Asha et al. (2021) [21] proposed a novel framework that leveraged Sparse Autoencoder and GANs to effectively differentiate fraudulent from non-fraudulent credit card transactions. This model stood out as a one-class classification technique, eliminating the need for mixed-type datasets containing both positive and negative instances 6 Chapter Two Data Augmentation Techniques GANs Generative Adversarial Networks (GANs) was originally invented in a landmark paper Goodfellow et al (2014) [8]. It is a deep learning architecture, training two neural networks (Generator and Discriminator) to compete against each other to generate more authentic new data from a given training samples. Generator network aims to create synthetic data samples. It takes a random noise vector as input and transforms it into synthetic data. Discriminator network acts as a critic, trying to distinguish between real data samples from the training samples and the generated samples produced by the generator. Figure 1 shows a basic GANs architecture. Figure 1: GANs Architecture 7 Training a GANs model involves an iterative competition between the generator and discriminator. At first, the generator initially creates random fake samples. Then the discriminator receives both real data and the generated fake samples. It tries to classify them as either real or fake accurately. In this phase, the discriminator is trained. Then in the second phase, based on the discriminator's feedback, the generator updates its weights to improve its ability to generate data to fool the discriminator, during this phase, the generator is being trained and the discriminator is not trainable. By integrating these two phases, the generator becomes better at creating synthetic data until it is hard for the discriminator to distinguish real or fake. minG maxDV(D, G) = Ex~pdata(x) [log D(x)] + Ez~pz(z) [log(1 − D (G(z))] (1) The value x and z are sampled from the real data distribution and noise distribution respectively. Unlike traditional interpolation techniques, GANs generates new samples by learning the underlying data distribution, thereby enhancing sample diversity. Whilst GANs is gaining popularity in many applications, they have notable issues. GANs is notoriously difficult to train properly and difficult to evaluate, and it suffers from the vanishing gradient problem, mode collapse, and boundary distortion [18]. Based on framework, loss function, and specific applications, GANs has many variants such as DCGANs [22] which leverages convolutional and deconvolutional layers, WGANs [23] which uses Wasserstein Distance instead of Jensen-Shannon divergence to improve training stability, and cGANs [24] which introduces additional information as input to both the generator and discriminator. 8 SMOTE Synthetic Minority Over-sampling TEchnique (SMOTE) was introduced in 2002 [9] and turns to be most commonly used to overcome the issue of class imbalance. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. Xnew = X + rand(0, 1) * (Xold − X) (2) where Xold − X is the line that identifies the minority Xold in the original dataset with one of its k neighbors X, and then, the artificial sample points Xnew are selected by linear random. While SMOTE excels in handling small datasets, its computational demands can be substantial for larger ones [18]. Moreover, it is susceptibility to overfitting and noise can limit its effectiveness, especially when the minority class is inherently noisy or the synthetic data fails to accurately represent the underlying distribution. Furthermore, new samples generated by SMOTE without considering the labels of neighboring examples can lead to class overlap, potentially hindering model performance. ADASYN Adaptive Synthetic Sampling (ADASYN) is a variant of SMOTE [10]. Unlike SMOTE, which generates synthetic minority class samples uniformly, ADASYN assigns weights to minority class instances based on their level of difficulty to learn. Instances that are closer to the decision boundary will generate more synthetic samples. Meanwhile, ADASYN is also applicable to the multiple‑class imbalanced learning challenge [19]. 9 However, ADASYN can encounter challenges when dealing with sparsely distributed minority class instances. If each minority class instance has only a few neighbors within a specified radius, the generated synthetic samples might be overly concentrated in specific regions of the feature space, leading to inadequate representation of the underlying distribution of the minority class. In addition, by concentrating on instances close to the decision boundary, ADASYN might overrepresent these instances in the synthetic data. This can lead to a model that is overly sensitive to these borderline cases, potentially reducing the precision. Machine Learning Models Logistic Regression Logistic Regression (LR) is a statistical method for predicting the probability of a binary outcome based on one or more independent variables. Because of its simplicity, Logistic Regression is the most widely used learning algorithm for binary classification tasks. Forecasts are transformed into probabilities using the sigmoid function [25]. Let, x ∈ Rm denote the input feature vector of length m, then the response z is given as a straight line z = w * x + b, where w is the weights and b is the bias term estimated during training. Thus, the logistic function is given as g(z) =1/(1 + e−z) , 0 < g(z) < 1 (3) The parameters of the logistic regression model are determined by maximum likelihood. Compared with most machine learning methods with a black-box nature, Logistic Regression has excelled at interpretability. The coefficients of logistic regression can be interpreted as the log-odds ratio, providing insights into the relationship between features 10 and the outcome. Moreover, its rapid training and prediction capabilities make it well-suited for handling large datasets. Random Forest Random Forest (RF) is an ensemble learning method introduced in 2001 [26]. It constructs multiple decision trees and combines their predictions through majority voting for classification tasks or mean prediction for regression tasks to improve predictive accuracy and control overfitting. Random Forest introduces randomness at two stages: bootstrap sampling, where each decision tree is trained on a random subset of the dataset with replacement, and random feature selection for each split in the decision tree. This technique makes the decision trees more diverse and helps to reduce correlation between individual trees. The diversity among the trees reduces its sensitivity to noise and improves its performance on unseen data, even if there are missing data points. Moreover, Random Forest is effective in handling large volumes of datasets and unbalanced ones [27], and can provide an estimate of the importance of different features in the dataset. Extreme Gradient Boosting Extreme Gradient Boosting (XGB) is a scalable and efficient implementation of the gradient boosting algorithm. This algorithm focuses on sequentially improving the model by focusing on errors made by previous models, starting with one weak learner and iteratively adding new weak learners to approximate functional gradients. In each iteration, the error residuals of the previous model are used to fit the next model. The final ensemble model is constructed by a weighted summation of all weak learners. Instead of picking each data instance with equal probability as with the “bagging” algorithm, the “gradient boosting” algorithm makes it more likely to pick instances that the previously trained learners 11 misclassified in each iteration. “Bagging” minimizes the variance and overfitting, while “Boosting” minimizes the bias and underfitting [25]. The scalability of XGB is attributed to its algorithmic optimizations and system-level enhancements. Unlike sequential gradient boosting algorithms, XGB constructs weak learners in parallel, significantly accelerating training. Furthermore, it incorporates tree pruning and regularization techniques to control model complexity and prevent overfitting. These combined features make XGB a leading choice for many machine learning tasks, offering a balance of prediction accuracy and computational efficiency. Evaluation Metrics A fraud detection system should be able to maximize the detection of fraudulent transactions while minimizing the number of incorrectly predicted frauds (false positives). It is often necessary to consider multiple measures to assess the overall performance of a fraud detection system. Confusion Matrix A confusion matrix is a table that is used to define the performance of a classification algorithm. Each column of the matrix represents the actual class, while each row represents the predicted class. There are four possible outcomes of a classifier: True positives (TP) are instances of the positive class that are correctly predicted as positive by the classifier; True negatives (TN) are instances of the negative class that are correctly predicted as negative; False positives (FP) are instances of the negative class that are incorrectly predicted as positive; False negatives (FN) are instances of the positive class that are incorrectly predicted as negative [28]. In this study fraudulent transitions are positive class and genuine transactions are negative class. The confusion matrix is represented as follows: 12 Table 1: Confusion Matrix Precision, Recall, and F1-Score Precision and Recall are metrics derived from the confusion matrix, focusing on ratios within the columns or rows of a confusion matrix. F1-score is defined as the harmonic mean of the two quantities. Precision (also known as Positive Predicted Value) measures the accuracy of positive predictions made by a classifier. It is the ratio of correctly predicted positive instances to the total number of predicted positive instances. Precision = TP / (TP + FP) (4) Recall (also known as True Positive Rate or Sensitivity) measures the ability of a classifier to find all positive instances within a dataset. It is the ratio of correctly predicted positive instances to the total number of actual positive instances. Recall (TPR) = TP / (TP + FN) (5) FPR= FP / (TN + PF) (6) The objectives of fraud detection models are to achieve both high precision and high recall. However, these metrics are often inversely related. In practical applications, F1-Score as a composite measure of precision and recall is more frequently employed. F1-Score=2 * Precision * Recall / (Precision + Recall) (7) 13 ROC Curve and AUC The aforementioned metrics are contingent upon the classification threshold, which is a decision boundary that separates the positive and negative classes. The default threshold is typically set at 0.5. Variations in the classification threshold directly influence the calculated values of these statistical metrics. In practical applications, the threshold is often calibrated to align with specific risk management objectives [28]. The Receiving Operating Characteristic (ROC) curve is obtained by plotting TPR against FPR for all the different classification thresholds. A perfect classifier will have a ROC curve that hugs the top-left corner, with a TPR of 1 and an FPR of 0. The Area Under the Curve of ROC called AUC, serves to evaluate the performance of a classifier over a range of different thresholds. Figure 2: An example of ROC Curve . 14 Chapter Three Dataset Credit card fraud transactions are characterized by low frequency but high impact, making them a top priority for financial institutions' risk management. The more effective fraud prevention measures are, the fewer fraudulent transactions accumulate, resulting in a scarcity of fraud samples for research. Additionally, due to data security and regulatory requirements, real-world financial data is difficult to obtain publicly, further hindering the acquisition of research samples for credit card fraud. Therefore, this study utilizes the publicly available dataset of European cardholders in September 2013 downloaded from Kaggle [11], which is also commonly used in credit card fraud research. The dataset covers transactions over two days, where we have 492 frauds out of a total of 284,870 transactions. The dataset is highly unbalanced, the positive class account for 0.172% of all transactions[11]. Its imbalanced distribution represents the actual condition worth being concerned with. All 30 input variables are numerical with no missing or erroneous values. Feature “Time” contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature “Amount” is the transaction Amount[11]. Apart from these two variables, the remaining 28 variables are derived from the PCA transformation due to confidentiality concerns on the original information. Tables 2-3 and Figure 3 present the outline of the dataset. 15 Table 2: The summary of the dataset 16 Figure 3: Feature Distribution Table 3: Feature Importance Values GANs Architecture and Hyperparameters To build the GANs model for this research, the author choose the vanilla GANs architecture due to its simplicity and representation. Both the generator and the discriminator comprise of 17 four dense layers. For the first third dense layers, each is followed by a dropout layer and a batch normalization layer. Adding dropout layers is to prevent overfitting while batch normalization layer to speed up the training process. The last dense layer works as a output layer with the sigmoid activation function. Scaled Exponential Linear Unit(SeLu) is selected as the activation function for the Generator because of its good properties in maintaining gradient stability and enhancing generalization so that it can help the generator produce more diverse and high-quality samples. As for the Discriminator, the author uses the Leaky Rectified Linear Unit (LeakyReLU) as the activation function because it allows a small gradient for negative values which helps to reduce the risk of overfitting. Notably, the generator's training process does not involve direct interaction with the original training data. It is guided solely by the feedback provided by the discriminator. After 100 epochs of training ( batch_size=32 ), the loss of the generator and discriminator tend to be stable and the generator is used to augment the training set. Table 4 and Figure 4 present the hyperparameters of the generator and the discriminator and their losses during the training process. Table 4: Hyperparameters of the Generator and the Discriminator 18 Figure 4: The losses of the Generator and the Discriminator Experimental Procedure Division of Dataset In this study, the author initially divides the dataset into a training and test dataset at a ratio of 8:2 (Table 5). All data augmentation techniques are exclusively applied to the training dataset to ensure that the test dataset remained completely unseen, serving as a benchmark for evaluating the final performance of the model. This approach allows them to conduct comparative analyses of various data augmentation techniques, augmentation ratios, and modeling methods under a unified evaluation standard. Table 5: The outline of training and testing datasets 19 Subsequently, after applying different augmentation techniques to the training dataset, the author further split 10% of each augmented training dataset into a corresponding validation dataset. The validation dataset serves a dual purpose: evaluating the model's performance during training and comparing it to the performance on the test set to assess its generalization ability. Data Augmentation In this experiment, three data augmentation techniques, GANs, SMOTE, and ADASYN, are employed to augment the training dataset by increasing the proportion of fraud transaction samples to 10%, 30%, 50%, 80%, and 100% of the normal samples, respectively (Figure 5 and Table 6). This allows the author to extend the research under varying sample ratios. After data augmentation, 15 augmented training datasets are prepared. Figure 5: Odds in a different group Table 6: The outline of different datasets 20 Based on the degree of fraud samples augmentation (different ratios of positive to negative samples), the entire experiment can be divided into five groups. Each group contains four training datasets: three augmented training datasets obtaining from different augmentation techniques, and one original training dataset for comparison. Figure 6 presents the correlation matrix of the original training dataset and the three augmented balanced training datasets. Figure 6: Correlation Matrix 21 Models Construction Subsequently, 10% of each augmented training dataset are reserved as a validation dataset and excluded from model training. The remaining 90% are used to train models under three machine learning algorithms: Logistic Regression, Random Forest, and XGB. The performance of these models are evaluated on both the validation and test datasets. Table 7 demonstrates the hyperparameters of each classifier. Table 7: Hyperparameters of each classifier 22 Chapter Four In real-world credit card transaction practices, financial institutions need to both effectively intercept fraudulent transactions to prevent financial losses and, at the same time, ensure a good customer experience by minimizing false positives. This means striking a balance between recall and precision. Therefore, this study employs F1-score, the harmonic mean of recall and precision, as the primary evaluation metric for subsequent analysis. Meanwhile, considering that the F1-Score is calculated based on a default classification threshold of 0.5, AUC is chosen as a supplementary evaluation metric to explore the classifiers’ performance under different thresholds. To ensure the robustness of the results, the entire experiment is repeated five times with different random seeds (1952, 1980, 1986, 2013, and 2024), and the average results are reported. Validation Results Analysis Tables 8-10 present the Precision, Recall, and F1-Score of validation datasets, and Figure 7 shows the comparison analysis of F1-Score under different odds for data augmentation. 23 Table 8: Precision of validation datasets Table 9: Recall of validation datasets 24 Table 10: F1-Score of validation datasets 25 Figure 7: Validation F1-Score comparison under different odds Based on F1-Score, it can be concluded that regardless of the data augmentation technique employed, the degree to which the proportion of fraud samples is increased, or the 26 modeling method used, data augmentation generally improves model performance. Moreover, there is a trend that as the proportion of fraudulent samples increases, the model performance improves. The only exception is observed when ADASYN is combined with Random Forest. Additionally, among the three data augmentation techniques, GANs demonstrates the best performance, regardless of the classification algorithms used. In contrast, ADASYN yields the worst results. From the perspective of classification algorithms, before applying data augmentation techniques, there is a significant difference in model performance among the three classification algorithm. XGB performed the best, followed by Random Forest, which significantly outperformed Logistic Regression. However, after applying data augmentation techniques, Logistic Regression showed the most significant improvement in model performance, while the gap between Random Forest and Logistic Regression is almost eliminated. Test Results Analysis While the validation dataset results indicate that data augmentation techniques significantly improve model performance, it is important to note that the validation datasets have also undergone data augmentation, resulting in containing a large number of synthetic samples. The strong performance on the validation dataset does not guarantee models’ ability to generalize to real-world unseen data. The effectiveness of a credit card fraud detection model should be evaluated using real, unprocessed data. To evaluate the models’ performance in real-world scenarios, a 20% holdout is designated as a test dataset, untouched by any data augmentation techniques. This section will detail the evaluation 27 results on this pristine test dataset. Tables 11-13 present the Precision, Recall, and F1-Score evaluated from the unique test dataset, and Figure 7 shows the comparison analysis of F1-Score under different odds for data augmentation. Table 11: Precision of the unique test dataset 28 Table 12: Recall of the unique test dataset Table 13: F1-Score of the unique test dataset 29 Figure 8: Test F1-Score comparison under different odds When evaluated on the test dataset, ADASYN exhibits the worst generalization ability. All models trained on ADASYN-based datasets perform worse than the baseline models (trained on the original training dataset). As for SMOTE, only Random Forest outperforms the baseline and only when the odds is 1:10. For both data augmentation techniques, as the proportion of synthetic fraud samples increases further, the models’ 30 generalization ability declines. Moreover, models built on SMOTE-based and ADASYN-based datasets demonstrate a notable trend: the Precision decreases significantly as the proportion of fraudulent samples increases. This suggests that these two data augmentation techniques may lead to severe overfitting, as the synthetic samples are highly similar to the original minority class samples. There seems to be an optimal balance between the number of synthetic samples and the original minority class samples, beyond which overfitting becomes more pronounced and worsens with increasing numbers of synthetic samples. 31 Figure 9: Comparative analysis chart of data augmentation ratios In contrast, models trained on GANs-based datasets demonstrate superior generalization ability. Both Logistic Regression and XGB models outperform the baseline models, while Random Forest, although underperforming the baseline, still achieves 32 significantly better results compared to SMOTE-based and ADASYN-based datasets. Additionally, another notable characteristic is its insensitivity to the proportion of fraudulent samples. Unlike SMOTE and ADASYN, GANs does not tend to overfit as the number of synthetic samples increased. Instead, the models demonstrate remarkable stability across datasets with varying proportions of fraudulent samples. ROC curves (Figure 10) and AUC (Table 14) further demonstrate the superiority of using GANs for data augmentation. At all five augmentation ratios, GANs achieves AUC scores higher than the baseline, while SMOTE and ADASYN fall below the baseline. Table 14: AUC from XGB at five augmentation ratios 33 34 35 Figure 10: ROC Curves from XGB model Furthermore, this experiment reveals that for GANs-based datasets, Logistic Regression classifiers achieve the best performance on perfectly balanced datasets. In contrast, for XGB classifiers, generating synthetic minority samples to account for only 10% of the majority class yield the optimal results. For Random Forest classifiers, the best practice is found to be a combination with SMOTE, with optimal performance observed when the minority class constitutes 10% of the majority class. Ultimately, a comparison of results from validation and test dataset demonstrates that relying solely on augmented data for model evaluation can provide a distorted view of performance. Taking the SMOTE experiment as an example, while SMOTE might appear effective based on validation results, real-world performance reveals its limitations, with only one out of fifteen experiments surpassing the baseline. 36 Chapter Five Conclusions This study investigates the application of GAN-based data augmentation in credit card fraud detection scenarios compared with traditional SMOTE and ADASYN techniques. To mitigate potential biases introduced by a single classification algorithm, three machine learning algorithms are employed. This approach enables the exploration of the combined effects of various data augmentation techniques and classification algorithms. Two types of datasets are utilized to assess model performance by the metrics of Precision, Recall, F1-score, and AUC-ROC. Validation datasets serve to evaluate individual models’ modeling effectiveness, while the unique test dataset is used to assess the models’ performance in real-world scenarios and provides a common benchmark for cross-model comparisons. Experimental results reveal that while all three data augmentation techniques improve modeling effectiveness to varying degrees, only GANs consistently enhances the models’ performance in real-world applications, regardless of the augmentation ratio. In contrast, SMOTE and ADASYN demonstrate poor generalization ability, their effectiveness diminish as the proportion of synthetic fraud samples increased. Further Research GANs exhibits diverse architectures and loss functions, each with unique characteristics. This study just employs the foundational Vanilla GANs structure. Despite its groundbreaking nature, Vanilla GANs faces several shortcomings such as mode collapse and vanishing gradients. Future research endeavors will delve into more sophisticated GANs variants, such as WGANs which helps to improve training stability and address vanishing 37 gradient problems and TGANs which incorporates pre-processing steps like normalization and categorical embedding to handle mixed data types, to refine GANs performance and generate higher-quality synthetic data. Moreover, exploring the synergy between GANs and other machine learning classification algorithms presents an avenue for further investigation. To comprehensively evaluate the implementation’s performance, a broader range of evaluation metrics will be considered in future studies. 38 Reference [1] Strelcenia, E., & Prakoonwit, S. (2023). A survey on gan techniques for data augmentation to address the imbalanced data issues in credit card fraud detection. Machine Learning and Knowledge Extraction, 5(1), 304-329. [2] Islam, M. A., Uddin, M. A., Aryal, S., & Stea, G. (2023). An ensemble learning approach for anomaly detection in credit card data with imbalanced and overlapped classes. Journal of Information Security and Applications, 78, 103618. [3] Chargebacks911.com. 24 Key Credit Card Fraud Statistics to Know for 2024. https://chargebacks911.com/credit-card-fraud-statistics/ [4] Cherif, A., Badhib, A., Ammar, H., Alshehri, S., Kalkatawi, M., & Imine, A. (2023). Credit card fraud detection in the era of disruptive technologies: A systematic review. Journal of King Saud University-Computer and Information Sciences, 35(1), 145-174. [5] Lucas, Y., & Jurgovsky, J. (2020). Credit card fraud detection using machine learning: A survey. arXiv preprint arXiv:2010.06479. [6] Elkan, C. (2001, August). The foundations of cost-sensitive learning. In International joint conference on artificial intelligence (Vol. 17, No. 1, pp. 973-978). Lawrence Erlbaum Associates Ltd. [7] Yeşilkanat, A., Bayram, B., Köroğlu, B., & Arslan, S. (2020). An adaptive approach on credit card fraud detection using transaction aggregation and word embeddings. In Artificial Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, June 5–7, 2020, Proceedings, Part I 16 (pp. 3-14). Springer International Publishing. [8] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. [9] Nitesh, V. C. (2002). SMOTE: synthetic minority over‐sampling technique. J Artif Intell Res, 16(1), 321. [10] He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). Ieee. [11] Kaggle: Credit Card Fraud Dataset https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud 39 [12] Dornadula, V. N., & Geetha, S. (2019). Credit card fraud detection using machine learning algorithms. Procedia computer science, 165, 631-641. [13] Rtayli, N., & Enneya, N. (2020). Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. Journal of Information Security and Applications, 55, 102596. [14] Tran, T. C., & Dang, T. K. (2021, January). Machine learning for prediction of imbalanced data: Credit fraud detection. In 2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM) (pp. 1-7). IEEE. [15] Bagga, S., Goyal, A., Gupta, N., & Goyal, A. (2020). Credit card fraud detection using pipeling and ensemble learning. Procedia Computer Science, 173, 104-112. [16] Yang, W., Zhang, Y., Ye, K., Li, L., & Xu, C. Z. (2019). Ffd: A federated learning based method for credit card fraud detection. In Big Data–BigData 2019: 8th International Congress, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings 8 (pp. 18-32). Springer International Publishing. [17] Ingole, S., Kumar, A., Prusti, D., & Rath, S. K. (2021). Service-based credit card fraud detection using oracle SOA suite. SN Computer Science, 2, 1-9. [18] Ngwenduna, K. S., & Mbuvha, R. (2021). Alleviating class imbalance in actuarial applications using generative adversarial networks. Risks, 9(3), 49. [19] Strelcenia, E., & Prakoonwit, S. (2023). Improving classification performance in credit card fraud detection by using new data augmentation. AI, 4(1). [20] Ba, H. (2019). Improving detection of credit card fraudulent transactions using generative adversarial networks. arXiv preprint arXiv:1907.03355. [21] Asha, R. B., & KR, S. K. (2021). Credit card fraud detection using artificial neural network. Global Transitions Proceedings, 2(1), 35-41. [22] Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015. arXiv preprint arXiv:1511.06434. [23] Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214-223). PMLR. [24] Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. [25] Isangediok, M., & Gajamannage, K. (2022, December). Fraud detection using optimized machine learning tools under imbalance classes. In 2022 IEEE International Conference on Big Data (Big Data) (pp. 4275-4284). IEEE. 40 [26] Breiman, L. (2001). Random forests. Machine learning, 45, 5-32. [27] Bin Sulaiman, R., Schetinin, V., & Sant, P. (2022). Review of machine learning approach on credit card fraud detection. Human-Centric Intelligent Systems, 2(1), 55-68. [28] Le Borgne, Y. A., Siblini, W., Lebichot, B., & Bontempi, G. (2022). Reproducible machine learning for credit card fraud detection-practical handbook. Université Libre de Bruxelles. 41