# Credit Card Fraud Detection Using Multivariate Gaussians

* Source: https://www.kaggle.com/code/hrao768/gaussian-distrib-for-anomaly-detection-f1-83/notebook
* Accessed: Jan 15, 2025
* Modified as needed

For training and evaluating Gaussian distribution algorithms, we are going to split the train, cross validation and test data sets using below ratios.

    1) Train:  60% of the Genuine records (y=0), no Fraud records(y=1). So the training set will not have a label as well.
    
    2) CV:  20% of the Genuine records (y=0), 50% of the Fraud records(y=1)
    
    3) Test: Remaining 20% of the Genuine records(y=0), Remaining 50% of the Fraud records(y=1)



Procedure for anomaly detection:

    1) Fit the model p(x) on training set
    
    2) On cross validation/test data, predict
    
        y = 1 if p(x) < epsilon (anomaly)
        
        y = 0 if p(x) >= epsilon (normal)
        
    3) We use cross validation to choose parameter epsilon using the evaluation metrics Preceion/Recall, F1-score.
    


We could use couple of Gaussian distribution models for training anomaly detection.

    1) Gaussian (Normal) Distribution - the normal distribution is parametrized in terms of the mean and the variance.
    
    2) Multivariate Normal Distribution - The probability density function for multivariate_normal is parametrized in terms of the mean and the covariance.

Algorithm Selection:

    1) For this dataset, we are going to use multivariate normal probability density function, since it automatically generates the relationships (correlation) between variables to calculate the probabilities. So we don't need to derive new features. As the features are outcome of PCA, it is difficult for us to understand the relationship between these features. 

    2) However multivariate normal probability density function is computationally expensive compared to normal Gaussian probability density function. On very large datasets, we might have to prefer Gaussian probability density function instead of multivariate normal probability density function to speed up the process and do feature engineering.


Feature Selection:

    1) Features that we choose for these algorithms have to be normally distributed. Otherwise we need to transform the features to normal distribution using log, sqrt etc.

    2) Choose features that might take on unusually large or small values in the event of an anomaly. We looked at the distribution in the beginning using distplot. So it is wise to choose features which have completely different distribution for fraud records compared to genuine records.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import matplotlib.gridspec as gridspec
import seaborn as sns

from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split,cross_val_predict,cross_val_score, GridSearchCV,RandomizedSearchCV
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.metrics import confusion_matrix,classification_report,f1_score,recall_score,precision_score,accuracy_score,precision_recall_curve,roc_curve,roc_auc_score

from collections import Counter

from scipy.stats import norm, multivariate_normal

plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

import warnings
warnings.filterwarnings('ignore')

import random
random.seed(0)

In [2]:
def Print_Accuracy_Scores(y,y_pred):
    print("F1 Score: ", f1_score(y,y_pred))
    print("Precision Score: ", precision_score(y,y_pred))
    print("Recall Score: ", recall_score(y,y_pred))

In [5]:
#Loading Dataset
# UNCOMMENT ONE

url = "https://github.com/AET-CS/aet-cs.github.io/blob/main/white/ML/data/creditcard.csv"

# cc_dataset = pd.read_csv(url=url)
cc_dataset = pd.read_csv("../data/creditcard.csv")

FileNotFoundError: [Errno 2] No such file or directory: '../data/creditcard.csv'

In [4]:
cc_dataset.shape

NameError: name 'cc_dataset' is not defined

In [None]:
cc_dataset.head()

In [None]:
cc_dataset.describe()

In [None]:
#Code for checking if any feature has null values. Here the output confirms that there are no null values in this data set.
cc_dataset.isnull().any()

In [None]:
#Counts for each class in the dataset. As you can see, we have only 492 (0.17%) fraud cases out of 284807 records. Remaining 284315 (99.8%) of the records belong to genuine cases.
#So the dataset is clearly imbalanced!
cc_dataset['Class'].value_counts()

In [None]:
#Data Visualization for checking the distribution for Genuine cases & Fraud cases for each feature
v_features = cc_dataset.columns
plt.figure(figsize=(12,31*4))
gs = gridspec.GridSpec(31,1)

for i, col in enumerate(v_features):
    ax = plt.subplot(gs[i])
    sns.distplot(cc_dataset[col][cc_dataset['Class']==0],color='g',label='Genuine Class')
    sns.distplot(cc_dataset[col][cc_dataset['Class']==1],color='r',label='Fraud Class')
    ax.legend()
plt.show()

Feature selection: 
    1) We can see Normal Distribution of anomalous transactions (class = 1) is matching with Normal Distribution of genuine transactions (class = 0) for V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8' features. It is better to delete these features as they may not be useful in finding anomalous records.
    2) Time is also not useful variable since it contains the seconds elapsed between the transaction for that record and the first transaction in the dataset. So the data is in increasing order always.

In [None]:
cc_dataset.drop(labels = ['V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8','Time'], axis = 1, inplace=True)
cc_dataset.columns

Below features doesn't have the same distribution for both genuine & fraud records. However distribution for fraud records is not unusual as well.
So I'll delete these features as well, since the features with unusual behavior for Fraud records will be most useful in anomaly detection algorithm.

In [None]:
cc_dataset.drop(labels = ['V1','V2','V5','V6','V7','V21','Amount'], axis = 1, inplace=True)
cc_dataset.columns

In [None]:
#Visualization to understand the relationship between features and also data pattern using pair plot from seaborn
# cc_subset = cc_dataset.sample(frac=0.001)
# g = sns.pairplot(cc_subset,hue="Class",diag_kind='kde')

(Plot omitted. It takes a long time and doesn't reveal much)

There is not much insight form the pairplot except that most of features have clear separation for fraud records versus genuine records. We can notice that distribution of fraud records is quite different compared to genuine records in the diagonal kde plots. All the features looks to be normally distributed. So we can train the Multivariate Guassian Distribution algorthm using the original features.

In [None]:
#Method for selecting epsilon with best F1-score
def SelectThresholdByCV_Anomaly(probs,y):
    best_epsilon = 0
    best_f1 = 0
    f = 0
    precision =0
    recall=0
    best_recall = 0
    best_precision = 0
    
    #epsilons = sorted(np.unique(probs))
    #print(epsilons)
    epsilons = np.arange(0,1,0.01)
    
    precisions=[]
    recalls=[]
    for epsilon in epsilons:
        predictions = (probs < epsilon)
        f = f1_score(y, predictions)
        precision = precision_score(y, predictions)
        recall = recall_score(y, predictions)
        #print("Theshold {0},Precision {1},Recall {2}".format(epsilon,precision,recall))
          
        if f > best_f1:
            best_f1 = f
            best_precision = precision
            best_recall = recall
            best_epsilon = epsilon
        
        precisions.append(precision)
        recalls.append(recall)

    #Precision-Recall Trade-off
    plt.plot(epsilons,precisions,label='Precision')
    plt.plot(epsilons,recalls,label='Recall')
    plt.xlabel("Epsilon")
    plt.title('Precision Recall Trade Off')
    plt.legend()
    plt.show()

    print ('Best F1 Score %f' %best_f1)
    print ('Associated Precision Score %f' %best_precision)
    print ('Associated Recall Score %f' %best_recall)
    print ('Associated Epsilon', best_epsilon)
    return best_epsilon

In [None]:
#Method for calculating parameters Mu & Co-variance
def estimateGaussian(data):
    mu = np.mean(data,axis=0)
    sigma = np.cov(data.T)
    return mu,sigma

In [None]:
#Method for implementing multivariate gaussian distribution pdf, scaled
def MultivariateGaussianDistribution(data,mu,sigma):
    p = multivariate_normal.pdf(data, mean=mu, cov=sigma)
    p_transformed = np.power(p,1/100) #transformed the probability scores by p^1/100 since the values are very low (up to e-150)
    return p_transformed

In [None]:
genuine_data = cc_dataset[cc_dataset['Class']==0]
fraud_data = cc_dataset[cc_dataset['Class']==1]

# optionally reduce data for speed
genuine_data = genuine_data.sample(frac=1, random_state=42)
fraud_data = fraud_data.sample(frac=1.0)

In [None]:
#Split Genuine records into train & test - 60:40 ratio
genuine_train,genuine_test = train_test_split(genuine_data,test_size=0.4,random_state=0)
print(genuine_train.shape)
print(genuine_test.shape)

In [None]:
#Split 40% of Genuine Test records into Cross Validation & Test again (50:50 ratio)
genuine_cv,genuine_test = train_test_split(genuine_test,test_size=0.5,random_state=0)
print(genuine_cv.shape)
print(genuine_test.shape)

In [None]:
#Split Fraud records into Cross Validation & Test (50:50 ratio)
fraud_cv,fraud_test = train_test_split(fraud_data,test_size=0.5,random_state=0)
print(fraud_cv.shape)
print(fraud_test.shape)

In [None]:
#Drop Y-label from Train data
train_data = genuine_train.drop(labels='Class',axis=1)
print(train_data.shape)

In [None]:
#Cross validation data
cv_data = pd.concat([genuine_cv,fraud_cv])
cv_data_y = cv_data['Class']
cv_data.drop(labels='Class',axis=1,inplace=True)
print(cv_data.shape)

In [None]:
#Test data
test_data = pd.concat([genuine_test,fraud_test])
test_data_y = test_data['Class']
test_data.drop(labels='Class',axis=1,inplace=True)
print(test_data.shape)

In [None]:
#StandardScaler â€“ Feature scaling is not required since all the features are already standardized via PCA
#sc = StandardScaler()
#train_data = sc.fit_transform(train_data)
#cv_data = sc.transform(cv_data)
#test_data = sc.transform(test_data)

In [None]:
#Find out the parameters Mu and Covariance for passing to the probability density function
mu,sigma = estimateGaussian(train_data)

In [None]:
mu

In [None]:
#Multivariate Gaussian distribution - This calculates the probability for each record.
p_train = MultivariateGaussianDistribution(train_data,mu,sigma)
print(p_train.mean())
print(p_train.std())
print(p_train.max())
print(p_train.min())

In [None]:
p_train

In [None]:
cv_data.shape, mu.shape

In [None]:
#Calculate the probabilities for cross validation and test records by passing the mean and co-variance matrix derived from train data
p_cv = MultivariateGaussianDistribution(cv_data,mu,sigma)
p_test = MultivariateGaussianDistribution(test_data,mu,sigma)

In [None]:
print(p_cv.mean())
print(p_cv.std())
print(p_cv.max())
print(p_cv.min())

In [None]:
#Calculate the probabilities for cross validation and test records by passing the mean and co-variance matrix derived from train data
pf_cv = MultivariateGaussianDistribution(fraud_cv.drop('Class',axis=1),mu,sigma)

In [None]:
print(pf_cv.mean())
print(pf_cv.std())
print(pf_cv.max())
print(pf_cv.min())

In [None]:
#Let us use cross validation to find the best threshold where the F1 -score is high
eps_optimal = SelectThresholdByCV_Anomaly(p_cv,cv_data_y)

In [None]:
#CV data - Predictions
pred_cv= (p_cv < eps_optimal)
Print_Accuracy_Scores(cv_data_y, pred_cv)

In [None]:
#Confusion matrix on CV
cnf_matrix = confusion_matrix(cv_data_y,pred_cv)
row_sum = cnf_matrix.sum(axis=1,keepdims=True)
cnf_matrix_norm =cnf_matrix / row_sum 
sns.heatmap(cnf_matrix_norm,cmap='YlGnBu',annot=True);
plt.title("Normalized Confusion Matrix - Cross Validation");

(These numbers in this paragraph and after will vary...)

Please notice that False negatives are around 24%. I tried to reduce false negatives & improve recall score by increasing the epsilon. I was successful in bringing the recall above 80%, however precsion is going down to 70% pretty quickly. Hence I decided to choose the epsilon with best f1-score, i.e: 0.2425

In [None]:
#Test data - Check the F1-score by using the best threshold from cross validation
pred_test = (p_test < eps_optimal)
Print_Accuracy_Scores(test_data_y,pred_test)

In [None]:
cnf_matrix = confusion_matrix(test_data_y, pred_test)
row_sum = cnf_matrix.sum(axis=1,keepdims=True)
cnf_matrix_norm =cnf_matrix / row_sum 
sns.heatmap(cnf_matrix_norm,cmap='YlGnBu',annot=True)
plt.title("Normalized Confusion Matrix - Test data")

Conclusion: Anomaly detection algorthm has provided decent results with F1-score of 83. We can improve recall & thus f1-score further by deriving new features based on the business knowledge. Since the features are transformed from PCA output, we couldn't understand their purpose and do feature engineering.