# Credit Card Fraud Detection Using SMOTE

* Source: https://www.kaggle.com/code/hrao768/anomaly-detection-using-smote
* Accessed Jan 16, 2025
* Modified as needed

This is the 2nd approach I'm sharing for credit card fraud detection. Refer to my earlier kernel @ https://www.kaggle.com/hrao768/gaussian-distrib-for-anomaly-detection-f1-83

We are going to explore resampling techniques like oversampling in this 2nd approach. Here are the key steps involved in this kernel.

    1) Balance the dataset by oversampling fraud class records using SMOTE
    
    2) Train the model using oversampled data by Random Forest
    
    3) Evaluate the performance of this model based on predictions on original imbalanced test data
        
    4) Add cluster segments to the original train and test data using K-Means algorithm
    
    5) Repeat the steps 1, 2 & 3 and see if the performance of Random Forest has improved by adding clusters
    
    6) Finally evaluate our model performance and check if it can generalize well on the unseen data using K-fold cross validation on original train data
    

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import matplotlib.gridspec as gridspec
import seaborn as sns

from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split,cross_val_predict,cross_val_score, GridSearchCV,RandomizedSearchCV
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.metrics import confusion_matrix,classification_report,f1_score,recall_score,precision_score,accuracy_score,precision_recall_curve,roc_curve,roc_auc_score

from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.cluster import KMeans
from imblearn.over_sampling import SMOTE

from collections import Counter

plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

import warnings
warnings.filterwarnings('ignore')

import random
random.seed(0)

In [None]:
def data_preparation(data):
    features = data.iloc[:,0:-1]
    label = data.iloc[:,-1]
    x_train,x_test,y_train,y_test = train_test_split(features,label,test_size=0.2,random_state=0)

    #Standarad scaler is not applied since all the features are outcomes of PCA and are already standardized.
    #sc = StandardScaler()
    #x_train = sc.fit_transform(x_train)
    #x_test = sc.transform(x_test)
    
    print("Length of training data",len(x_train))
    print("Length of test data",len(x_test))
    return x_train,x_test,y_train,y_test
    

In [None]:
def build_model_train_test(model,x_train,x_test,y_train,y_test):
    model.fit(x_train,y_train)

    y_pred = model.predict(x_train)
    
    print("\n----------Accuracy Scores on Train data------------------------------------")
    print("F1 Score: ", f1_score(y_train,y_pred))
    print("Precision Score: ", precision_score(y_train,y_pred))
    print("Recall Score: ", recall_score(y_train,y_pred))


    print("\n----------Accuracy Scores on Test data------------------------------------")
    y_pred_test = model.predict(x_test)
    
    print("F1 Score: ", f1_score(y_test,y_pred_test))
    print("Precision Score: ", precision_score(y_test,y_pred_test))
    print("Recall Score: ", recall_score(y_test,y_pred_test))

    #Confusion Matrix
    plt.figure(figsize=(18,6))
    gs = gridspec.GridSpec(1,2)

    ax1 = plt.subplot(gs[0])
    cnf_matrix = confusion_matrix(y_train,y_pred)
    row_sum = cnf_matrix.sum(axis=1,keepdims=True)
    cnf_matrix_norm =cnf_matrix / row_sum
    sns.heatmap(cnf_matrix_norm,cmap='YlGnBu',annot=True)
    plt.title("Normalized Confusion Matrix - Train Data")

    ax2 = plt.subplot(gs[1])
    cnf_matrix = confusion_matrix(y_test,y_pred_test)
    row_sum = cnf_matrix.sum(axis=1,keepdims=True)
    cnf_matrix_norm =cnf_matrix / row_sum
    sns.heatmap(cnf_matrix_norm,cmap='YlGnBu',annot=True)
    plt.title("Normalized Confusion Matrix - Test Data")


In [None]:
#Loading Dataset
# UNCOMMENT ONE

url = "https://github.com/AET-CS/aet-cs.github.io/blob/main/white/ML/data/creditcard.csv"

# cc_dataset = pd.read_csv(url=url)
# cc_dataset = pd.read_csv("../data/creditcard.csv")

# Optional shrink for speed
cc_dataset = cc_dataset.sample(frac = 0.2)

In [None]:
cc_dataset.shape

In [None]:
cc_dataset.head()

In [None]:
cc_dataset.describe()

In [None]:
#Code for checking if any feature has null values. Here the output confirms that there are no null values in this data set.
cc_dataset.isnull().any()

In [None]:
#Counts for each class in the dataset. As you can see, we have only 492 (0.17%) fraud cases out of 284807 records. Remaining 284315 (99.8%) of the records belong to genuine cases.
#So the dataset is clearly imbalanced!
cc_dataset['Class'].value_counts()

## Feature Selection

In [None]:
SKIP_CELL = True  # Change to False to run the cell

if not SKIP_CELL:
    print("This cell is running!")
    #Data Visualization for checking the distribution for Genuine cases & Fraud cases for each feature
    v_features = cc_dataset.columns
    plt.figure(figsize=(12,31*4))
    gs = gridspec.GridSpec(31,1)
    
    for i, col in enumerate(v_features):
        ax = plt.subplot(gs[i])
        sns.distplot(cc_dataset[col][cc_dataset['Class']==0],color='g',label='Genuine Class')
        sns.distplot(cc_dataset[col][cc_dataset['Class']==1],color='r',label='Fraud Class')
        ax.legend()
    plt.show()
else:
    print("Cell skipped (same plots as the other notebook.)")
    


Feature selection: 
    1) We can see distribution of anomalous transactions (class = 1) is matching with distribution of genuine transactions (class = 0) for V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8' features. It is better to delete these features as they may not be useful in finding anomalous records.
    2) Time is also not useful variable since it contains the seconds elapsed between the transaction for that record and the first transaction in the dataset. So the data is in increasing order always.
    
Let us remove the feature 'Time' for now and build the model.

In [None]:
cc_dataset.drop(labels = ['Time'], axis = 1, inplace=True)

The feature 'Amount' has higher standard deviation of 250, which indicate the spread is very high & also we might have outliers in the data. So let us go for feature scaling for Amount variable using StandardScaler().

In [None]:
cc_dataset['Amount'] = StandardScaler().fit_transform(cc_dataset[['Amount']])

In [None]:
#Data Preparation
x_train,x_test,y_train,y_test = data_preparation(cc_dataset)

imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

imbalanced-learn is currently available on the PyPi's repository and you can install it via pip: pip install -U imbalanced-learn

I'm going to use Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset here.

In [None]:
os = SMOTE(random_state=0)

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter

# Initialize SMOTE
os = SMOTE(random_state=0)

# Generate the oversampled data
os_res_x, os_res_y = os.fit_resample(x_train, y_train)

# Counts of each class in oversampled data
print(sorted(Counter(os_res_y).items()))


We can see that fraud records are imputed and brought close to genuine records in this oversampled data using SMOTE. Hence both classes are equally distributed now.

In [None]:
#RandomForest for training over-sampled data set. 
rnd_clf = RandomForestClassifier(n_estimators=100,criterion='gini',n_jobs=-1, random_state=0)
#Train the model on oversampled data and check the performance on original test data
build_model_train_test(rnd_clf,os_res_x,x_test,os_res_y,y_test)

RandomForest has given good results after balancing the training data using synthetic over-sampling approach: F1 score of 85 on orignal test data (without oversmapling)

I'd like to try K-means clustering to identify the clusters in the dataset, which could improve the predictive power in fraud detection.

In [None]:
# #Elbow Curve for identifying the best number of clusters
# wcss = [] # Within Cluster Sum of Squares
# for k in range(1, 21):
#     kmeans = KMeans(n_clusters = k, init = 'k-means++', random_state = 0)
#     kmeans.fit(x_train)
#     wcss.append(kmeans.inertia_)
# plt.plot(range(1, 21), wcss)
# plt.title('The Elbow Method')
# plt.xlabel('Number of clusters - k')
# plt.ylabel('WCSS')
# plt.show()

In [None]:
#Clustering with 11 clusters. I used the elbow method to derive on number of clusters. I commented the above code to save the run time
kmeans_best = KMeans(n_clusters = 11, init = 'k-means++', random_state = 0)
train_clusters = kmeans_best.fit_predict(x_train)

In [None]:
#Merge clusters with other input features on Train Data
x_train2 = np.c_[(x_train,train_clusters )]
x_train2.shape

In [None]:
#Predict the cluster for test data & merge it with other features
test_clusters = kmeans_best.predict(x_test)
x_test2 = np.c_[(x_test,test_clusters )]
x_test2.shape

In [None]:
#Generate the oversample data for training purpose
os_res_x2,os_res_y2=os.fit_resample(x_train2,y_train)
#Counts of each class in oversampled data
print(sorted(Counter(os_res_y2).items()))


In [None]:
#RandomForest for training over-sampled data set. 
rnd_clf2 = RandomForestClassifier(n_estimators=100,criterion='gini',n_jobs=-1, random_state=0)
#Train the model on oversampled data and check the performance on actual test data
build_model_train_test(rnd_clf2,os_res_x2,x_test2,os_res_y2,y_test)


Post adding the clusters to the dataset, the performance of RandomForest model has improved little bit: F1 score of 87 and recall score of 85 on the orignal test data (without oversmapling).
Let us check the consistency of this model by using cross validation scores based on the original train data.

In [None]:
#Let us check cross validation scores on the orginal train data
cv_score = cross_val_score(rnd_clf2,x_train2,y_train,cv=5,scoring='f1')
print("Average F1 score CV",cv_score.mean())


In [None]:
cv_score = cross_val_score(rnd_clf2,x_train2,y_train,cv=5,scoring='recall')
print("Average Recall score CV",cv_score.mean())


On the cross validation, recall score has gone down little bit. However overall F1-score is still around 85. So we can go ahead with this model.

Conclusion: In general, oversampling techniques like SMOTE should provide better results than normal supervised learning algorithms on imbalanced datasets. We added clustering over the top of SMOTE to identify the patterns better and it has given best results on this dataset.