Giter Site home page Giter Site logo

saket03-p / ethereum-transactions-fraud-detection Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 11.78 MB

Worked on detecting illicit transactions in the Ethereum Transactions dataset by increasing our dataset size, and with little tolerance to missing fraudulent transactions.

Home Page: https://github.com/Saket03-P/Ethereum-Transactions-Fraud-Detection

Jupyter Notebook 100.00%
cost-sensitive-learning ctgan fraud-detection machine-learning

ethereum-transactions-fraud-detection's Introduction

Cost Sensitive Approach to Ethereum Transactions Fraud Detection using Machine Learning

Ethereum Fraud Image

The existing dataset for the task of Ethereum Transactions Fraud Detection is available in Kaggle Link to the Original Dataset, with 9841 samples and spread across 51 features. Besides this, the class distribution of the samples for Fraud : Legitimate = 1 : 4.

Why is the dataset the way it is :

  • In the sectors of finance or health care, there is always a scarcity of data due to privacy concerns of their users. And so the size of the datasets being used for classificaiton purposes in such domains also reduces.
  • Also, the number of cases having a particular condition in the domain of some health care or finance issue is very minute as compared to the number of samples which aren't affected by that condition. For example, at any point of time, the number of people infected by a virus in a population is always considerably low. This also explains why the number of frauds are very less as compared to the number of legitimate transactions in our dataset.

This leads to the following issues as :

  • Such a small dataset causes our models which we use for classification tend to overfit on the training set and do not capture the necessary root causes for finding frauds in the ecosystem.
    • In addition to an already existing small dataset, the outliers and incomplete samples also cause the dataset to diminish on data pre processing steps.
  • When our models are exposed to such a skewed datasets, they get trained on a very few fraud transactions due to the imbalanced nature of the data; which creates a hard time for the model to detect frauds in a real time system as there are not much patterns which can be learnt.
    • This may lead to misclassification of some fruadulent transactions as legitimate ones, and they persist in the ecosystem and adversely affecting other innocent transactors.

Our methodology to fix the issues :

  • We've considered generating new synthetic data samples from our already existing dataset by employing the CTGAN Model CTGANSynthesizer Model which is very suitable for generating samples statisitcally representative of our original tabular dataset.
  • Then we made use of cost sensitive learning while using our classification models inorder to minimize the misclassification of fraud samples. This is accomplished by assigning higher misclassification costs for missing out fraudulent transactions thus helping the model in prioritizing which error costs to minimize.

Tasks Accomplished :

  • Successfully created an aggregated dataset by doubling the original dataset with synthetic data samples, which have 85.63% similarity quality with the existing dataset Aggregated Dataset available here.
  • The incorporation of Cost Sensitive Learning depicts the usefulness of our models for real time detection systems which can afford identifying legitimate as fraud ones, as we can reassure this with the transactors; rather than allowing a fraudulent transaction to harm the ecosystem as it couldn't be detected by our system.

Characteristics of Aggregated Dataset

Overall Distribution of the FLAG Column
Overall Distribution of the FLAG Column


CTGAN Loss Function for the Dataset CTGAN Loss Function for the dataset


Columns Similarity between Synthetic & Original Data Columns Similarity between Synthetic & Original Data


Comparison of Classification Metrics in the absence / in the presence of Cost Sensitive Learning

Model Accuracy Precision Recall
Decision Tree Classifier 0.9805 0.9827 0.9309
Random Forest Classifier 0.9837 0.9898 0.938
AdaBoost Classifier 0.9798 0.9673 0.9435
Light Gradient Boosting Machine 0.9881 0.987 0.9606
Extreme Gradient Boosting Machine 0.99 0.9857 0.9703

Evaluation Metrics of Models without Cost Sensitive Learning


Model Accuracy Precision Recall
Decision Tree Classifier 0.9781 0.9524 0.9517
Random Forest Classifier 0.9854 0.986 0.9494
AdaBoost Classifier 0.977 0.9416 0.9584
Light Gradient Boosting Machine 0.9897 0.9842 0.9703
Extreme Gradient Boosting Machine 0.9907 0.9842 0.9747

Evaluation Metrics of Models using Cost Sensitive Learning(= 3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.