philipp-sc / llm-fraud-detection Goto Github PK

Robust semi-supervised spam detection using Rust native NLP pipelines.

License: Apache License 2.0

Dockerfile 0.81% Rust 97.44% Shell 0.24% Python 1.51%

llm-fraud-detection's Introduction

Philipp's GitHub activity

LLM Fraud Detection

Leveraging llama.cpp to generate text embeddings from a given input text, which are then used to predict the likelihood of fraud.
https://github.com/Philipp-Sc/llm-fraud-detection

Cosmos Rust Package

An API to query and broadcast transactions via gRPC. Makes direct use of cosmos-rust (cosmos‑sdk‑proto, cosmrs) and osmosis-rust (osmosis-std).
https://github.com/Philipp-Sc/cosmos-rust-package

Permutation feature importance

A rust port to aid in the task of feature selection.
https://github.com/Philipp-Sc/importance

Other Contributions

https://github.com/whisperfish/presage
https://github.com/cosmos/cosmos-rust
https://github.com/MiscellaneousStuff/openai-whisper-cpu

llm-fraud-detection's People

Contributors

Stargazers

Watchers

Forkers

chiyee

llm-fraud-detection's Issues

add DAO governance proposals dataset

access data quality: inspect entries where the prediction != label

add K-Means Clustering step before KNN

to reduce the KNN model size.

use a Llama 2 based model to generate the embeddings

https://huggingface.co/AlekseyKorshuk/vicuna-7b

Improve engineered features for even better accuracy

The current engineered 'hard-coded' features are very basic, while they provide useful information there is room for improvement.

src/build/feature_engineering/mod.rs

Instead of hard-coded conditions, create / augment with Bag Of Words vector that is derived from the training dataset.

E.g then using a frequency encoding of common words that often occur within spam but not in ham and vice versa.
Resulting in two vectors that together contain the most important/common words for/against a spam classification.

Model Evaluation

Instead of testing the model performance on the same data it was trained on, generate a training and test dataset.

90% training data
10% test data

make sure to sample spam and ham.

Use a Neural Network instead of the Random Forest Regressor

Replacing the Random Forest Regressor with a better model (Neural Network) should improve the performance.

(Random Forests can not extrapolate, that means it has difficulties to generalize and handle unseen data.)

Re-generate topics and re-train fraud detection

Re-generate topics and re-train fraud detection with bigger dataset of governance proposals.

governance_proposal_spam_ham.csv 
---------------
count spam: 172
count ham: 2551

Note: This will be great to reduce false positives, since the model has not yet seen many ham (and spam) data for governance proposals.

Note: consider reducing the ham dataset by filtering some of the rejected proposals with high votes against. To make sure not to train likely spam as ham.

gov prpsl spam_likelihood dataset improved

re-train using the improved crypto governance proposal dataset.

now considering the final state (rejected, passed, failed) as well
more proposals added